fix: normalize test names with strip() before pass/fail comparison#77
Open
pedropnaves wants to merge 1 commit intoscaleapi:mainfrom
Open
fix: normalize test names with strip() before pass/fail comparison#77pedropnaves wants to merge 1 commit intoscaleapi:mainfrom
pedropnaves wants to merge 1 commit intoscaleapi:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The scoring logic in
swe_bench_pro_eval.pyuses exact string equality when checking whether required tests passed:The
run_script.shinstances usesedto prefix test descriptions with the source filename. In certain test titles this transformation introduces a trailing space in the name written tooutput.jsonbyparser.py. Because thefail_to_passentries in the dataset do not have this trailing space, the comparison fails and the instance is marked unresolved even though every test passed.Confirmed affected test (instance
instance_NodeBB__NodeBB-00c70ce7b0541cfc94afe567921d7668cdc8f4ac-vnan):Result: the instance scores
Falsedespite 714/714 tests passing.Fix
Strip surrounding whitespace from test names on both sides before comparing:
This is a safe, non-breaking change: legitimate test names do not have meaningful leading or trailing whitespace.
Related
See issue #76 for a second, related data-quality bug in the
fail_to_passfield of the HuggingFace dataset (truncated test names due to unescaped embedded double-quotes) that requires a dataset update to fully resolve.Greptile Summary
This PR fixes a scoring bug where instances were incorrectly marked as unresolved despite all tests passing. The root cause is that
run_script.shusessedto prefix test descriptions with filenames, and this transformation can introduce trailing whitespace in the test names written tooutput.json. Since thefail_to_pass/pass_to_passentries in the dataset lack this trailing space, the exact string comparison fails.The fix adds
.strip()to normalize whitespace on both sides of the comparison — onoutput["tests"]names, and on thefail_to_pass/pass_to_passentries. This is a minimal, safe, and correct change.sedinrun_script.sh(line 37-39) can introduce trailing spaces in Mocha test titlesparser.pypropagates these spaces intooutput.jsonviafullTitleConfidence Score: 5/5
.strip()calls to normalize whitespace before string comparison. The root cause is well-documented and confirmed. No new logic, no new dependencies, no behavioral change for test names that don't have extraneous whitespace.Important Files Changed
.strip()to normalize test names on both sides of the pass/fail comparison (lines 555-557), fixing false negatives caused by trailing whitespace fromsed-based test name prefixing inrun_script.sh.Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[run_script.sh: sed prefixes test names with filename] --> B[Mocha outputs JSON with fullTitle] B --> C[parser.py constructs name: file + fullTitle] C --> D["output.json: test name may have trailing space"] E["Dataset: fail_to_pass / pass_to_pass entries (no trailing space)"] D --> F{"Compare: f2p ∪ p2p ⊆ passed_tests"} E --> F F -- "Before fix: exact match fails" --> G[Instance marked unresolved ❌] F -- "After fix: .strip() normalizes both sides" --> H[Instance correctly resolved ✅]Last reviewed commit: ab721d5
(2/5) Greptile learns from your feedback when you react with thumbs up/down!