fix: disable TCP keepalive by default to avoid connection timeouts#276
fix: disable TCP keepalive by default to avoid connection timeouts#276
Conversation
Keepalive probes caused connection timeout errors in offline and high-concurrency modes. Dead connections are already handled by connection_lost/eof_received callbacks in the protocol layer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
There was a problem hiding this comment.
Code Review
This pull request disables TCP keepalive for the inference endpoint client to mitigate connection timeout errors observed in high-concurrency environments. A review comment points out that disabling keepalive may lead to "zombie" connections during silent network drops, as the current implementation lacks timeouts in the read methods; it is recommended to implement application-level timeouts or use TCP_USER_TIMEOUT as a safeguard.
| # NOTE(vir): | ||
| # we hit lots of connection timed out errors in offline and high-concurrency modes, | ||
| # disabling since we handle dead-connections in http.py connection_lost/eof_received |
There was a problem hiding this comment.
Disabling TCP keepalive can lead to "zombie" connections in the pool if a connection is silently dropped by a network middlebox (e.g., a firewall or NAT gateway). While connection_lost and eof_received handle active closures (FIN/RST), they cannot detect silent drops. Since the protocol's read_headers and read_body methods lack timeouts, a request using such a connection could hang indefinitely. It is recommended to ensure an application-level timeout is implemented elsewhere or to consider a non-zero TCP_USER_TIMEOUT (on Linux) as a safeguard.
There was a problem hiding this comment.
Pull request overview
Disables TCP keepalive probes by default in the endpoint HTTP client to reduce connection timeout errors observed in offline and high-concurrency modes (issue #202), relying instead on protocol/connection lifecycle handling in the client.
Changes:
- Set
_SocketConfig.SO_KEEPALIVEdefault to0(disabled). - Gate application of
TCP_KEEP*tuning behindSO_KEEPALIVEbeing enabled.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # NOTE(vir): | ||
| # we hit lots of connection timed out errors in offline and high-concurrency modes, | ||
| # disabling since we handle dead-connections in http.py connection_lost/eof_received | ||
| SO_KEEPALIVE: int = 0 # Disabled | ||
| TCP_KEEPIDLE: int = 1 # Probe after 1s idle (disabled) | ||
| TCP_KEEPCNT: int = 5 # 5 failed probes = dead (disabled) | ||
| TCP_KEEPINTVL: int = 1 # 1s between probes (disabled) |
| TCP_KEEPIDLE: int = 1 # Probe after 1s idle (disabled) | ||
| TCP_KEEPCNT: int = 5 # 5 failed probes = dead (disabled) | ||
| TCP_KEEPINTVL: int = 1 # 1s between probes (disabled) |
Keepalive probes caused connection timeout errors in offline and high-concurrency modes. Dead connections are already handled by connection_lost/eof_received callbacks in the protocol layer.
attempt to close #202
What does this PR do?
Type of change
Related issues
Testing
Checklist