fix: normalize test names with strip() before pass/fail comparison by pedropnaves · Pull Request #77 · scaleapi/SWE-bench_Pro-os

pedropnaves · 2026-02-28T15:46:58Z

Problem

The scoring logic in swe_bench_pro_eval.py uses exact string equality when checking whether required tests passed:

passed_tests = {x["name"] for x in output["tests"] if x["status"] == "PASSED"}
f2p = set(eval(raw_sample["fail_to_pass"]))
result = (f2p | p2p) <= passed_tests

The run_script.sh instances use sed to prefix test descriptions with the source filename. In certain test titles this transformation introduces a trailing space in the name written to output.json by parser.py. Because the fail_to_pass entries in the dataset do not have this trailing space, the comparison fails and the instance is marked unresolved even though every test passed.

Confirmed affected test (instance instance_NodeBB__NodeBB-00c70ce7b0541cfc94afe567921d7668cdc8f4ac-vnan):

# In output["tests"]  (has trailing space):
"...getSortedSetRange() should work with big arrays (length > 100) "

# In fail_to_pass  (no trailing space):
"...getSortedSetRange() should work with big arrays (length > 100)"

Result: the instance scores False despite 714/714 tests passing.

Fix

Strip surrounding whitespace from test names on both sides before comparing:

passed_tests = {x["name"].strip() for x in output["tests"] if x["status"] == "PASSED"}
f2p = {t.strip() for t in eval(raw_sample["fail_to_pass"])}
p2p = {t.strip() for t in eval(raw_sample["pass_to_pass"])}

This is a safe, non-breaking change: legitimate test names do not have meaningful leading or trailing whitespace.

This PR fixes a scoring bug where instances were incorrectly marked as unresolved despite all tests passing. The root cause is that run_script.sh uses sed to prefix test descriptions with filenames, and this transformation can introduce trailing whitespace in the test names written to output.json. Since the fail_to_pass / pass_to_pass entries in the dataset lack this trailing space, the exact string comparison fails.

The fix adds .strip() to normalize whitespace on both sides of the comparison — on output["tests"] names, and on the fail_to_pass / pass_to_pass entries. This is a minimal, safe, and correct change.

Confirmed root cause: sed in run_script.sh (line 37-39) can introduce trailing spaces in Mocha test titles
parser.py propagates these spaces into output.json via fullTitle
Only one comparison site in the codebase, and this PR addresses it completely

Confidence Score: 5/5

This PR is safe to merge — it's a minimal, non-breaking fix to a confirmed scoring bug.
The change is a 3-line diff adding .strip() calls to normalize whitespace before string comparison. The root cause is well-documented and confirmed. No new logic, no new dependencies, no behavioral change for test names that don't have extraneous whitespace.
No files require special attention.

Important Files Changed

Filename	Overview
swe_bench_pro_eval.py	Adds `.strip()` to normalize test names on both sides of the pass/fail comparison (lines 555-557), fixing false negatives caused by trailing whitespace from `sed`-based test name prefixing in `run_script.sh`.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[run_script.sh: sed prefixes test names with filename] --> B[Mocha outputs JSON with fullTitle]
    B --> C[parser.py constructs name: file + fullTitle]
    C --> D["output.json: test name may have trailing space"]
    E["Dataset: fail_to_pass / pass_to_pass entries (no trailing space)"]
    D --> F{"Compare: f2p ∪ p2p ⊆ passed_tests"}
    E --> F
    F -- "Before fix: exact match fails" --> G[Instance marked unresolved ❌]
    F -- "After fix: .strip() normalizes both sides" --> H[Instance correctly resolved ✅]

_{Last reviewed commit: ab721d5}

_{(2/5) Greptile learns from your feedback when you react with thumbs up/down!}

fix: normalize test names with strip() before pass/fail comparison

ab721d5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: normalize test names with strip() before pass/fail comparison#77

fix: normalize test names with strip() before pass/fail comparison#77
pedropnaves wants to merge 1 commit intoscaleapi:mainfrom
pedropnaves:fix/test-name-normalization-in-eval

pedropnaves commented Feb 28, 2026 •

edited by greptile-apps bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pedropnaves commented Feb 28, 2026 • edited by greptile-apps bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Related

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pedropnaves commented Feb 28, 2026 •

edited by greptile-apps bot

Loading