Skip to content

fix: normalize test names with strip() before pass/fail comparison#77

Open
pedropnaves wants to merge 1 commit intoscaleapi:mainfrom
pedropnaves:fix/test-name-normalization-in-eval
Open

fix: normalize test names with strip() before pass/fail comparison#77
pedropnaves wants to merge 1 commit intoscaleapi:mainfrom
pedropnaves:fix/test-name-normalization-in-eval

Conversation

@pedropnaves
Copy link

@pedropnaves pedropnaves commented Feb 28, 2026

Problem

The scoring logic in swe_bench_pro_eval.py uses exact string equality when checking whether required tests passed:

passed_tests = {x["name"] for x in output["tests"] if x["status"] == "PASSED"}
f2p = set(eval(raw_sample["fail_to_pass"]))
result = (f2p | p2p) <= passed_tests

The run_script.sh instances use sed to prefix test descriptions with the source filename. In certain test titles this transformation introduces a trailing space in the name written to output.json by parser.py. Because the fail_to_pass entries in the dataset do not have this trailing space, the comparison fails and the instance is marked unresolved even though every test passed.

Confirmed affected test (instance instance_NodeBB__NodeBB-00c70ce7b0541cfc94afe567921d7668cdc8f4ac-vnan):

# In output["tests"]  (has trailing space):
"...getSortedSetRange() should work with big arrays (length > 100) "

# In fail_to_pass  (no trailing space):
"...getSortedSetRange() should work with big arrays (length > 100)"

Result: the instance scores False despite 714/714 tests passing.

Fix

Strip surrounding whitespace from test names on both sides before comparing:

passed_tests = {x["name"].strip() for x in output["tests"] if x["status"] == "PASSED"}
f2p = {t.strip() for t in eval(raw_sample["fail_to_pass"])}
p2p = {t.strip() for t in eval(raw_sample["pass_to_pass"])}

This is a safe, non-breaking change: legitimate test names do not have meaningful leading or trailing whitespace.

Related

See issue #76 for a second, related data-quality bug in the fail_to_pass field of the HuggingFace dataset (truncated test names due to unescaped embedded double-quotes) that requires a dataset update to fully resolve.

Greptile Summary

This PR fixes a scoring bug where instances were incorrectly marked as unresolved despite all tests passing. The root cause is that run_script.sh uses sed to prefix test descriptions with filenames, and this transformation can introduce trailing whitespace in the test names written to output.json. Since the fail_to_pass / pass_to_pass entries in the dataset lack this trailing space, the exact string comparison fails.

The fix adds .strip() to normalize whitespace on both sides of the comparison — on output["tests"] names, and on the fail_to_pass / pass_to_pass entries. This is a minimal, safe, and correct change.

  • Confirmed root cause: sed in run_script.sh (line 37-39) can introduce trailing spaces in Mocha test titles
  • parser.py propagates these spaces into output.json via fullTitle
  • Only one comparison site in the codebase, and this PR addresses it completely

Confidence Score: 5/5

  • This PR is safe to merge — it's a minimal, non-breaking fix to a confirmed scoring bug.
  • The change is a 3-line diff adding .strip() calls to normalize whitespace before string comparison. The root cause is well-documented and confirmed. No new logic, no new dependencies, no behavioral change for test names that don't have extraneous whitespace.
  • No files require special attention.

Important Files Changed

Filename Overview
swe_bench_pro_eval.py Adds .strip() to normalize test names on both sides of the pass/fail comparison (lines 555-557), fixing false negatives caused by trailing whitespace from sed-based test name prefixing in run_script.sh.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[run_script.sh: sed prefixes test names with filename] --> B[Mocha outputs JSON with fullTitle]
    B --> C[parser.py constructs name: file + fullTitle]
    C --> D["output.json: test name may have trailing space"]
    E["Dataset: fail_to_pass / pass_to_pass entries (no trailing space)"]
    D --> F{"Compare: f2p ∪ p2p ⊆ passed_tests"}
    E --> F
    F -- "Before fix: exact match fails" --> G[Instance marked unresolved ❌]
    F -- "After fix: .strip() normalizes both sides" --> H[Instance correctly resolved ✅]
Loading

Last reviewed commit: ab721d5

(2/5) Greptile learns from your feedback when you react with thumbs up/down!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant