[Fix] Optimize the evaluation process for OCR-Reasoning by shannanyinxiang · Pull Request #1413 · open-compass/VLMEvalKit

shannanyinxiang · 2026-01-24T07:59:32Z

fix1: delete the hardcoded "nproc=1" in the evaluation of OCR_Reasoning

Delete the hardcoded "nproc=1" in the evaluation of OCR_Reasoning to accelerate the LLM judging process.

fix2: Increase max_tokens of the LLM Judge for OCR-Reasoning

The OCR-Reasoning evaluation involves scoring of reasoning processes where the LLM judge is prompted to output explanatory text before final ratings.

VLMEvalKit/vlmeval/dataset/utils/ocr_reasoning.py

Line 7 in 2c25371

    
           judge_prompts = '''Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Your evaluation should consider correctness and helpfulness. You will be given a reference answer and the assistant's answer. Begin your evaluation by comparing the assistant's answer with the reference answer. Identify and correct any mistakes. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: \"[[rating]]\", for example: \"Rating: [[5]]\".\n\n[Question]\n{question}\n\n[The Start of Reference Answer]\n{ref_answer_1}\n[The End of Reference Answer]\n\n[The Start of Assistant's Answer]\n{answer}\n[The End of Assistant's Answer]". Again, you must output a score by strictly following this format: \"[[rating]]\", for example: \"Rating: [[5]]\".''' # noqa e501

Therefore, the original 1024 max_tokens setting is insufficient, constantly leading to unintended response truncation.

To accelerate the LLM judging process, delete the hardcoded "nproc=1" in the evaluation of OCR_Reasoning.

The OCR-Reasoning evaluation involves scoring of reasoning process. Moreover, the LLM judge is prompted to output explanatory text before final ratings (https://github.com/open-compass/VLMEvalKit/blob/2c25371d602909ae3d6d395185aff1bc9493262d/vlmeval/dataset/utils/ocr_reasoning.py#L7). Therefore, the original 1024 max_tokens setting is consistently insufficient, leading to unintended response truncation.

shannanyinxiang added 2 commits January 24, 2026 15:36

delete the hardcoded "nproc=1" in the evaluation of OCR_Reasoning

8d0df28

To accelerate the LLM judging process, delete the hardcoded "nproc=1" in the evaluation of OCR_Reasoning.

shannanyinxiang changed the title ~~Optimize the evaluation process for the OCR-Reasoning dataset~~ [Fix] Optimize the evaluation process for the OCR-Reasoning dataset Jan 24, 2026

shannanyinxiang changed the title ~~[Fix] Optimize the evaluation process for the OCR-Reasoning dataset~~ [Fix] Optimize the evaluation process for OCR-Reasoning Jan 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] Optimize the evaluation process for OCR-Reasoning#1413

[Fix] Optimize the evaluation process for OCR-Reasoning#1413
shannanyinxiang wants to merge 2 commits intoopen-compass:mainfrom
shannanyinxiang:main

shannanyinxiang commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shannanyinxiang commented Jan 24, 2026

fix1: delete the hardcoded "nproc=1" in the evaluation of OCR_Reasoning

fix2: Increase max_tokens of the LLM Judge for OCR-Reasoning

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant