Fix ZeroDivisionError and NaN propagation in reward model metrics #3779
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
The reward model metric functions in
model/model_training/metrics.pyhave two related bugs that can cause training to crash or silently produce corrupt evaluation results:ZeroDivisionError in
kendall_tau()andspearmanr(): Both functions divide bybsizeat the end without checking whether it is zero. When all labels in a batch are padding (-100), no label groups are found,bsizestays at 0, and the division raisesZeroDivisionError.NaN propagation: When a label group contains fewer than 2 ranked items,
scipy.stats.kendalltauandscipy.stats.spearmanrreturnNaN(correlation is undefined for single-element arrays). ThisNaNgets added to the running sum and silently corrupts the final metric value, making itNaNfor the entire evaluation step.Empty array in
reward_accuracy(): If no valid label groups are found,pos_scoresandneg_scoresremain empty lists. Callingnp.meanon an empty array produces aRuntimeWarningand returnsNaN.Fix
kendall_tau()andspearmanr().NaN, and only incrementbsizefor valid results.0.0instead of dividing by zero whenbsizeis 0.reward_accuracy()when no scores were collected.Testing
These edge cases arise during reward model evaluation when:
-100)The fix ensures that metric computation completes without errors and returns deterministic fallback values (
0.0) rather than crashing or returningNaN.