-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Description
使用Qwen3-4B-Instruct-2507做NLI时性能极差. 差不多在prompt中修改一些标点符号等,都可能引起最后的结果巨大差异. 对比了一下Qwen3-8B(关闭了Think) 却没有这个问题.
Reproduction
使用标准的Openai prompt格式:
prompt = [{'role': 'system', 'content': '\n<role>\nYou are a rigorous Logic & Sufficiency Evaluator. Your task is to determine if the provided `<context>` contains **sufficient, complete, and direct evidence** to resolve the `<target_problem>` without any external knowledge or hallucination. You are inherently cautious and MUST default to \'NO\' if there is any ambiguity, missing detail, or uncertainty.\n</role>\n\n<sop>\nYou must adhere to the following STANDARD OPERATING PROCEDURE (SOP) when determining whether the provided `<context>` contains sufficient information to directly resolve the `<target_problem>` (**Execute this internally before answering**):\n1. **Thoroughly Understand the Target Problem**: Carefully read and comprehend the provided `<target_problem>`, paying attention to all explicit requirements and implicit nuances.\n2. **Deeply Analyze the Context**: Meticulously review the entire `<context>`, identifying all relevant facts, data points, and observations that pertain to the `<target_problem>`.\n3. **Evaluate Sufficiency**: Rigorously evaluate whether the information in the `<context>` is **sufficient, complete, and direct evidence** to resolve the `<target_problem>`. Consider the following:\n - **Completeness**: Does the `<context>` provide all necessary information to address every aspect of the `<target_problem>`?\n - **Clarity**: Is the information in the `<context>` clear and unambiguous enough to draw definitive conclusions?\n - **Direct Relevance**: Are the facts and observations in the `<context>` directly applicable to solving the `<target_problem>` without requiring additional assumptions or external knowledge?\n4. **Decide on Sufficiency**: If you find that the `<context>` fully satisfies all criteria for sufficiency, you may respond "YES". However, if there are any uncertainties, gaps, or ambiguities, you must respond "NO".\n</sop>\n\n<output_format>\nYou MUST respond with a single word: "YES" if the `<context>` contains sufficient information to directly resolve the `<target_problem>`, or "NO" if it does not. You MUST NOT provide any additional explanation or commentary.\n</output_format>\n'}, {'role': 'user', 'content': '\n<target_problem>\n核对客户资料表单与营业执照上的企业名称、统一社会信用代码是否一致\n</target_problem>\n\n<context>\n客户已上传以下材料:\n- 客户资料表单 (Application Form)\n- 营业执照 (Business License)\n- 法人身份证 (Legal Representative ID Card)\n- 法人手持身份证照片 (Handheld ID Card Photo)\n- 股东身份证 (Shareholder ID Card)\n- 企业基础信息 (Structured Company Information)[\n {\n "action": "PLAN",\n "action_description": "Generate a detailed plan.",\n "step_content": {\n "tasks": [\n "核对客户资料表单与营业执照上的企业名称、统一社会信用代码是否一致",\n "验证营业执照的有效期、注册地址、经营范围是否在有效范围内",\n "比对法人身份证信息与手持身份证照片,确认身份信息一致且照片为本人真实持证状态",\n "核实法人身份信息与企业登记信息是否匹配,确认其为法定代表人",\n "提取股东信息,确认股东身份是否真实,比对股东身份证与企业登记信息是否一致",\n "通过企业基础信息数据,识别企业股权结构,识别最终受益所有人(UBO)",\n "分析UBO的国籍、职业、是否涉及高风险国家或职业(如政治人物、司法人员、高风险行业)",\n "比对客户申报的业务活动与经营范围是否匹配,是否存在不一致或虚构业务",\n "评估客户资金来源、用途与账户交易记录是否形成逻辑闭环,是否存在异常现金流",\n "根据企业所属行业、注册地、实际控制人背景、交易行为等要素,进行洗钱与制裁风险初步评估",\n "依据风险评估结果,对客户划分风险等级(如低、中、高风险)并记录依据",\n "检查开户流程是否符合KYC、CDD、AML等监管要求,确保信息留存完整且可追溯",\n "撰写合规筛查报告,明确客户合法性、透明性、风险等级及建议(如是否可开户、是否需加强监控等)"\n ],\n "node_type": "PLAN"\n }\n },\n {\n "action": "PROPOSE",\n "action_description": "Propose a task.",\n "step_content": {\n "task": "核对客户资料表单与营业执照上的企业名称、统一社会信用代码是否一致",\n "node_type": "TASK"\n }\n }\n]\n</context>\n\n<instruction>\nBased on the provided `<context>`, determine whether it contains **sufficient, complete, and direct evidence** to resolve the `<target_problem>`, strictly adhering to the `<sop>`.\n</instruction>\n'}]
temperature = 0, max_tokens=1 , 只要求输出一个Token.
4B输出大概率是YES (错的), 8B输出NO(正确的). 稍微改动里面的prompt(比如在context里面将内容用```符号包裹), 4B则有可能输出NO(正确的)非常不稳定,且不正确,
Logs
Environment Information
NA
Known Issue
- The issue hasn't been already addressed in Documentation, Issues, and Discussions.