使用Qwen3-4B-Instruct-2507做NLI时性能极差.

### Description

使用Qwen3-4B-Instruct-2507做NLI时性能极差. 差不多在prompt中修改一些标点符号等,都可能引起最后的结果巨大差异. 对比了一下Qwen3-8B(关闭了Think) 却没有这个问题.

### Reproduction

使用标准的Openai prompt格式:
```prompt = [{'role': 'system', 'content': '\n<role>\nYou are a rigorous Logic & Sufficiency Evaluator. Your task is to determine if the provided `<context>` contains **sufficient, complete, and direct evidence** to resolve the `<target_problem>` without any external knowledge or hallucination. You are inherently cautious and MUST default to \'NO\' if there is any ambiguity, missing detail, or uncertainty.\n</role>\n\n<sop>\nYou must adhere to the following STANDARD OPERATING PROCEDURE (SOP) when determining whether the provided `<context>` contains sufficient information to directly resolve the `<target_problem>` (**Execute this internally before answering**):\n1. **Thoroughly Understand the Target Problem**: Carefully read and comprehend the provided `<target_problem>`, paying attention to all explicit requirements and implicit nuances.\n2. **Deeply Analyze the Context**: Meticulously review the entire `<context>`, identifying all relevant facts, data points, and observations that pertain to the `<target_problem>`.\n3. **Evaluate Sufficiency**: Rigorously evaluate whether the information in the `<context>` is **sufficient, complete, and direct evidence** to resolve the `<target_problem>`. Consider the following:\n   - **Completeness**: Does the `<context>` provide all necessary information to address every aspect of the `<target_problem>`?\n   - **Clarity**: Is the information in the `<context>` clear and unambiguous enough to draw definitive conclusions?\n   - **Direct Relevance**: Are the facts and observations in the `<context>` directly applicable to solving the `<target_problem>` without requiring additional assumptions or external knowledge?\n4. **Decide on Sufficiency**: If you find that the `<context>` fully satisfies all criteria for sufficiency, you may respond "YES". However, if there are any uncertainties, gaps, or ambiguities, you must respond "NO".\n</sop>\n\n<output_format>\nYou MUST respond with a single word: "YES" if the `<context>` contains sufficient information to directly resolve the `<target_problem>`, or "NO" if it does not. You MUST NOT provide any additional explanation or commentary.\n</output_format>\n'}, {'role': 'user', 'content': '\n<target_problem>\n核对客户资料表单与营业执照上的企业名称、统一社会信用代码是否一致\n</target_problem>\n\n<context>\n客户已上传以下材料：\n- 客户资料表单 (Application Form)\n- 营业执照 (Business License)\n- 法人身份证 (Legal Representative ID Card)\n- 法人手持身份证照片 (Handheld ID Card Photo)\n- 股东身份证 (Shareholder ID Card)\n- 企业基础信息 (Structured Company Information)[\n  {\n    "action": "PLAN",\n    "action_description": "Generate a detailed plan.",\n    "step_content": {\n      "tasks": [\n        "核对客户资料表单与营业执照上的企业名称、统一社会信用代码是否一致",\n        "验证营业执照的有效期、注册地址、经营范围是否在有效范围内",\n        "比对法人身份证信息与手持身份证照片，确认身份信息一致且照片为本人真实持证状态",\n        "核实法人身份信息与企业登记信息是否匹配，确认其为法定代表人",\n        "提取股东信息，确认股东身份是否真实，比对股东身份证与企业登记信息是否一致",\n        "通过企业基础信息数据，识别企业股权结构，识别最终受益所有人（UBO）",\n        "分析UBO的国籍、职业、是否涉及高风险国家或职业（如政治人物、司法人员、高风险行业）",\n        "比对客户申报的业务活动与经营范围是否匹配，是否存在不一致或虚构业务",\n        "评估客户资金来源、用途与账户交易记录是否形成逻辑闭环，是否存在异常现金流",\n        "根据企业所属行业、注册地、实际控制人背景、交易行为等要素，进行洗钱与制裁风险初步评估",\n        "依据风险评估结果，对客户划分风险等级（如低、中、高风险）并记录依据",\n        "检查开户流程是否符合KYC、CDD、AML等监管要求，确保信息留存完整且可追溯",\n        "撰写合规筛查报告，明确客户合法性、透明性、风险等级及建议（如是否可开户、是否需加强监控等）"\n      ],\n      "node_type": "PLAN"\n    }\n  },\n  {\n    "action": "PROPOSE",\n    "action_description": "Propose a task.",\n    "step_content": {\n      "task": "核对客户资料表单与营业执照上的企业名称、统一社会信用代码是否一致",\n      "node_type": "TASK"\n    }\n  }\n]\n</context>\n\n<instruction>\nBased on the provided `<context>`, determine whether it contains **sufficient, complete, and direct evidence** to resolve the `<target_problem>`, strictly adhering to the `<sop>`.\n</instruction>\n'}]```

temperature = 0, max_tokens=1 , 只要求输出一个Token. 
4B输出大概率是YES (错的), 8B输出NO(正确的). 稍微改动里面的prompt(比如在context里面将内容用```符号包裹), 4B则有可能输出NO(正确的)非常不稳定,且不正确,


### Logs

```shell

```

### Environment Information

NA

### Known Issue

- [ ] The issue hasn't been already addressed in Documentation, Issues, and Discussions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

使用Qwen3-4B-Instruct-2507做NLI时性能极差. #1757

Description

Reproduction

Logs

Environment Information

Known Issue

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

使用Qwen3-4B-Instruct-2507做NLI时性能极差. #1757

Description

Description

Reproduction

Logs

Environment Information

Known Issue

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions