Skip to content

使用Qwen3-4B-Instruct-2507做NLI时性能极差. #1757

@weiminw

Description

@weiminw

Description

使用Qwen3-4B-Instruct-2507做NLI时性能极差. 差不多在prompt中修改一些标点符号等,都可能引起最后的结果巨大差异. 对比了一下Qwen3-8B(关闭了Think) 却没有这个问题.

Reproduction

使用标准的Openai prompt格式:
prompt = [{'role': 'system', 'content': '\n<role>\nYou are a rigorous Logic & Sufficiency Evaluator. Your task is to determine if the provided `<context>` contains **sufficient, complete, and direct evidence** to resolve the `<target_problem>` without any external knowledge or hallucination. You are inherently cautious and MUST default to \'NO\' if there is any ambiguity, missing detail, or uncertainty.\n</role>\n\n<sop>\nYou must adhere to the following STANDARD OPERATING PROCEDURE (SOP) when determining whether the provided `<context>` contains sufficient information to directly resolve the `<target_problem>` (**Execute this internally before answering**):\n1. **Thoroughly Understand the Target Problem**: Carefully read and comprehend the provided `<target_problem>`, paying attention to all explicit requirements and implicit nuances.\n2. **Deeply Analyze the Context**: Meticulously review the entire `<context>`, identifying all relevant facts, data points, and observations that pertain to the `<target_problem>`.\n3. **Evaluate Sufficiency**: Rigorously evaluate whether the information in the `<context>` is **sufficient, complete, and direct evidence** to resolve the `<target_problem>`. Consider the following:\n - **Completeness**: Does the `<context>` provide all necessary information to address every aspect of the `<target_problem>`?\n - **Clarity**: Is the information in the `<context>` clear and unambiguous enough to draw definitive conclusions?\n - **Direct Relevance**: Are the facts and observations in the `<context>` directly applicable to solving the `<target_problem>` without requiring additional assumptions or external knowledge?\n4. **Decide on Sufficiency**: If you find that the `<context>` fully satisfies all criteria for sufficiency, you may respond "YES". However, if there are any uncertainties, gaps, or ambiguities, you must respond "NO".\n</sop>\n\n<output_format>\nYou MUST respond with a single word: "YES" if the `<context>` contains sufficient information to directly resolve the `<target_problem>`, or "NO" if it does not. You MUST NOT provide any additional explanation or commentary.\n</output_format>\n'}, {'role': 'user', 'content': '\n<target_problem>\n核对客户资料表单与营业执照上的企业名称、统一社会信用代码是否一致\n</target_problem>\n\n<context>\n客户已上传以下材料:\n- 客户资料表单 (Application Form)\n- 营业执照 (Business License)\n- 法人身份证 (Legal Representative ID Card)\n- 法人手持身份证照片 (Handheld ID Card Photo)\n- 股东身份证 (Shareholder ID Card)\n- 企业基础信息 (Structured Company Information)[\n {\n "action": "PLAN",\n "action_description": "Generate a detailed plan.",\n "step_content": {\n "tasks": [\n "核对客户资料表单与营业执照上的企业名称、统一社会信用代码是否一致",\n "验证营业执照的有效期、注册地址、经营范围是否在有效范围内",\n "比对法人身份证信息与手持身份证照片,确认身份信息一致且照片为本人真实持证状态",\n "核实法人身份信息与企业登记信息是否匹配,确认其为法定代表人",\n "提取股东信息,确认股东身份是否真实,比对股东身份证与企业登记信息是否一致",\n "通过企业基础信息数据,识别企业股权结构,识别最终受益所有人(UBO)",\n "分析UBO的国籍、职业、是否涉及高风险国家或职业(如政治人物、司法人员、高风险行业)",\n "比对客户申报的业务活动与经营范围是否匹配,是否存在不一致或虚构业务",\n "评估客户资金来源、用途与账户交易记录是否形成逻辑闭环,是否存在异常现金流",\n "根据企业所属行业、注册地、实际控制人背景、交易行为等要素,进行洗钱与制裁风险初步评估",\n "依据风险评估结果,对客户划分风险等级(如低、中、高风险)并记录依据",\n "检查开户流程是否符合KYC、CDD、AML等监管要求,确保信息留存完整且可追溯",\n "撰写合规筛查报告,明确客户合法性、透明性、风险等级及建议(如是否可开户、是否需加强监控等)"\n ],\n "node_type": "PLAN"\n }\n },\n {\n "action": "PROPOSE",\n "action_description": "Propose a task.",\n "step_content": {\n "task": "核对客户资料表单与营业执照上的企业名称、统一社会信用代码是否一致",\n "node_type": "TASK"\n }\n }\n]\n</context>\n\n<instruction>\nBased on the provided `<context>`, determine whether it contains **sufficient, complete, and direct evidence** to resolve the `<target_problem>`, strictly adhering to the `<sop>`.\n</instruction>\n'}]

temperature = 0, max_tokens=1 , 只要求输出一个Token.
4B输出大概率是YES (错的), 8B输出NO(正确的). 稍微改动里面的prompt(比如在context里面将内容用```符号包裹), 4B则有可能输出NO(正确的)非常不稳定,且不正确,

Logs

Environment Information

NA

Known Issue

  • The issue hasn't been already addressed in Documentation, Issues, and Discussions.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions