Skip to content

Conversation

@Mr-Neutr0n
Copy link

Problem

The reward model metric functions in model/model_training/metrics.py have two related bugs that can cause training to crash or silently produce corrupt evaluation results:

  1. ZeroDivisionError in kendall_tau() and spearmanr(): Both functions divide by bsize at the end without checking whether it is zero. When all labels in a batch are padding (-100), no label groups are found, bsize stays at 0, and the division raises ZeroDivisionError.

  2. NaN propagation: When a label group contains fewer than 2 ranked items, scipy.stats.kendalltau and scipy.stats.spearmanr return NaN (correlation is undefined for single-element arrays). This NaN gets added to the running sum and silently corrupts the final metric value, making it NaN for the entire evaluation step.

  3. Empty array in reward_accuracy(): If no valid label groups are found, pos_scores and neg_scores remain empty lists. Calling np.mean on an empty array produces a RuntimeWarning and returns NaN.

Fix

  • Skip label groups with fewer than 2 items in kendall_tau() and spearmanr().
  • Only accumulate results that are not NaN, and only increment bsize for valid results.
  • Return 0.0 instead of dividing by zero when bsize is 0.
  • Return zeroed metrics dict early in reward_accuracy() when no scores were collected.

Testing

These edge cases arise during reward model evaluation when:

  • A batch contains only padding tokens (all labels are -100)
  • A label group has only a single ranked item (e.g., only one response for a given prompt)

The fix ensures that metric computation completes without errors and returns deterministic fallback values (0.0) rather than crashing or returning NaN.

The kendall_tau() and spearmanr() functions divide by bsize without
checking if it is zero. This causes a ZeroDivisionError when all
labels are padding (-100) or when the input batch has no valid
label groups.

Additionally, when a label group has fewer than 2 ranked items,
scipy's kendalltau/spearmanr return NaN, which silently propagates
through the accumulated score and corrupts the final metric value.

Changes:
- Skip label groups with fewer than 2 items (correlation is
  undefined for single-element arrays)
- Only increment bsize for groups that produce a valid (non-NaN)
  correlation result
- Return 0.0 instead of dividing by zero when bsize is 0
- Guard reward_accuracy() against empty score arrays, which would
  cause np.mean to return NaN on an empty array
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant