Skip to content

fix: disable TCP keepalive by default to avoid connection timeouts#276

Open
viraatc wants to merge 1 commit intomainfrom
feat/viraatc-drop-keepalive
Open

fix: disable TCP keepalive by default to avoid connection timeouts#276
viraatc wants to merge 1 commit intomainfrom
feat/viraatc-drop-keepalive

Conversation

@viraatc
Copy link
Copy Markdown
Collaborator

@viraatc viraatc commented Apr 10, 2026

Keepalive probes caused connection timeout errors in offline and high-concurrency modes. Dead connections are already handled by connection_lost/eof_received callbacks in the protocol layer.

attempt to close #202

What does this PR do?

Type of change

  • Bug fix
  • New feature
  • Documentation update
  • Refactor/cleanup

Related issues

Testing

  • Tests added/updated
  • All tests pass locally
  • Manual testing completed

Checklist

  • Code follows project style
  • Pre-commit hooks pass
  • Documentation updated (if needed)

Keepalive probes caused connection timeout errors in offline and
high-concurrency modes. Dead connections are already handled by
connection_lost/eof_received callbacks in the protocol layer.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@viraatc viraatc requested review from a team and Copilot April 10, 2026 01:09
@github-actions github-actions bot requested a review from arekay-nv April 10, 2026 01:09
@github-actions
Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request disables TCP keepalive for the inference endpoint client to mitigate connection timeout errors observed in high-concurrency environments. A review comment points out that disabling keepalive may lead to "zombie" connections during silent network drops, as the current implementation lacks timeouts in the read methods; it is recommended to implement application-level timeouts or use TCP_USER_TIMEOUT as a safeguard.

Comment on lines +55 to +57
# NOTE(vir):
# we hit lots of connection timed out errors in offline and high-concurrency modes,
# disabling since we handle dead-connections in http.py connection_lost/eof_received
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Disabling TCP keepalive can lead to "zombie" connections in the pool if a connection is silently dropped by a network middlebox (e.g., a firewall or NAT gateway). While connection_lost and eof_received handle active closures (FIN/RST), they cannot detect silent drops. Since the protocol's read_headers and read_body methods lack timeouts, a request using such a connection could hang indefinitely. It is recommended to ensure an application-level timeout is implemented elsewhere or to consider a non-zero TCP_USER_TIMEOUT (on Linux) as a safeguard.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Disables TCP keepalive probes by default in the endpoint HTTP client to reduce connection timeout errors observed in offline and high-concurrency modes (issue #202), relying instead on protocol/connection lifecycle handling in the client.

Changes:

  • Set _SocketConfig.SO_KEEPALIVE default to 0 (disabled).
  • Gate application of TCP_KEEP* tuning behind SO_KEEPALIVE being enabled.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +55 to +61
# NOTE(vir):
# we hit lots of connection timed out errors in offline and high-concurrency modes,
# disabling since we handle dead-connections in http.py connection_lost/eof_received
SO_KEEPALIVE: int = 0 # Disabled
TCP_KEEPIDLE: int = 1 # Probe after 1s idle (disabled)
TCP_KEEPCNT: int = 5 # 5 failed probes = dead (disabled)
TCP_KEEPINTVL: int = 1 # 1s between probes (disabled)
Comment on lines +59 to +61
TCP_KEEPIDLE: int = 1 # Probe after 1s idle (disabled)
TCP_KEEPCNT: int = 5 # 5 failed probes = dead (disabled)
TCP_KEEPINTVL: int = 1 # 1s between probes (disabled)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

"max_throughput" mode causes connection timeouts

2 participants