Skip to content

Conversation

@joyliu-q
Copy link

Problem

During eval rollouts, thousands of reward model requests fire simultaneously, overwhelming the remote endpoint.

Without having a very resilient endpoint, this causes:

  • Connection drops and timeouts
  • Cascading failures as retries compounded the load
  • Training runs failing unnecessarily on transient network issues

Solution

Added robustness improvements to remote_rm:

Connection pooling & concurrency limiting

  • Reuse a global aiohttp.ClientSession instead of creating one per request
  • Added semaphore to limit concurrent requests to 64 (prevents thundering herd)
  • Configured connection pool and timeouts (total=120s, connect=10s)

Smart retry logic

  • Only retry transient errors (connection errors, timeouts, 5xx, 429)
  • Fail fast on client errors (4xx) and response parsing errors (e.g., missing "score" key)
  • Exponential backoff with jitter: min(2^attempt, 30) + random() seconds

Cleanup

  • Registered atexit handler to close session on process exit

Testing

Ran training with a flaky endpoint - retries handled connection drops gracefully
Verified 4xx errors fail fast without wasting retry budget

@joyliu-q joyliu-q changed the title Make the Remote Reward Model more resillient instead of failing on connection drops or endpoint shakiness Add retries to the Remote Reward Model, do not fail on connection drops or endpoint instability Feb 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant