Skip to content

[Fix] Optimize the evaluation process for OCR-Reasoning#1413

Open
shannanyinxiang wants to merge 2 commits intoopen-compass:mainfrom
shannanyinxiang:main
Open

[Fix] Optimize the evaluation process for OCR-Reasoning#1413
shannanyinxiang wants to merge 2 commits intoopen-compass:mainfrom
shannanyinxiang:main

Conversation

@shannanyinxiang
Copy link

fix1: delete the hardcoded "nproc=1" in the evaluation of OCR_Reasoning

Delete the hardcoded "nproc=1" in the evaluation of OCR_Reasoning to accelerate the LLM judging process.

fix2: Increase max_tokens of the LLM Judge for OCR-Reasoning

The OCR-Reasoning evaluation involves scoring of reasoning processes where the LLM judge is prompted to output explanatory text before final ratings.

judge_prompts = '''Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Your evaluation should consider correctness and helpfulness. You will be given a reference answer and the assistant's answer. Begin your evaluation by comparing the assistant's answer with the reference answer. Identify and correct any mistakes. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: \"[[rating]]\", for example: \"Rating: [[5]]\".\n\n[Question]\n{question}\n\n[The Start of Reference Answer]\n{ref_answer_1}\n[The End of Reference Answer]\n\n[The Start of Assistant's Answer]\n{answer}\n[The End of Assistant's Answer]". Again, you must output a score by strictly following this format: \"[[rating]]\", for example: \"Rating: [[5]]\".''' # noqa e501

Therefore, the original 1024 max_tokens setting is insufficient, constantly leading to unintended response truncation.

To accelerate the LLM judging process, delete the hardcoded "nproc=1" in the evaluation of OCR_Reasoning.
The OCR-Reasoning evaluation involves scoring of reasoning process. Moreover, the LLM judge is prompted to output explanatory text before final ratings (https://github.com/open-compass/VLMEvalKit/blob/2c25371d602909ae3d6d395185aff1bc9493262d/vlmeval/dataset/utils/ocr_reasoning.py#L7). Therefore, the original 1024 max_tokens setting is consistently insufficient, leading to unintended response truncation.
@shannanyinxiang shannanyinxiang changed the title Optimize the evaluation process for the OCR-Reasoning dataset [Fix] Optimize the evaluation process for the OCR-Reasoning dataset Jan 24, 2026
@shannanyinxiang shannanyinxiang changed the title [Fix] Optimize the evaluation process for the OCR-Reasoning dataset [Fix] Optimize the evaluation process for OCR-Reasoning Jan 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant