Add retries to the Remote Reward Model, do not fail on connection drops or endpoint instability #1582

joyliu-q · 2026-02-12T16:30:32Z

Problem

During eval rollouts, thousands of reward model requests fire simultaneously, overwhelming the remote endpoint.

Without having a very resilient endpoint, this causes:

Connection drops and timeouts
Cascading failures as retries compounded the load
Training runs failing unnecessarily on transient network issues

Solution

Added robustness improvements to remote_rm:

Connection pooling & concurrency limiting

Reuse a global aiohttp.ClientSession instead of creating one per request
Added semaphore to limit concurrent requests to 64 (prevents thundering herd)
Configured connection pool and timeouts (total=120s, connect=10s)

Smart retry logic

Only retry transient errors (connection errors, timeouts, 5xx, 429)
Fail fast on client errors (4xx) and response parsing errors (e.g., missing "score" key)
Exponential backoff with jitter: min(2^attempt, 30) + random() seconds

Cleanup

Registered atexit handler to close session on process exit

Testing

Ran training with a flaky endpoint - retries handled connection drops gracefully
Verified 4xx errors fail fast without wasting retry budget

…connection drops or endpoint shakiness

joyliu-q added 4 commits February 12, 2026 16:25

🐛 Make the Remote Reward Model more resillient instead of failing on …

f094d83

…connection drops or endpoint shakiness

🎨 Better impl

2d4ab9b

🎨 Fix

1e13506

🎨 self rev

de2e0e0

joyliu-q changed the title ~~Make the Remote Reward Model more resillient instead of failing on connection drops or endpoint shakiness~~ Add retries to the Remote Reward Model, do not fail on connection drops or endpoint instability Feb 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add retries to the Remote Reward Model, do not fail on connection drops or endpoint instability #1582

Add retries to the Remote Reward Model, do not fail on connection drops or endpoint instability #1582

Uh oh!

joyliu-q commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add retries to the Remote Reward Model, do not fail on connection drops or endpoint instability #1582

Are you sure you want to change the base?

Add retries to the Remote Reward Model, do not fail on connection drops or endpoint instability #1582

Uh oh!

Conversation

joyliu-q commented Feb 12, 2026

Problem

Solution

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant