fix: disable TCP keepalive by default to avoid connection timeouts by viraatc · Pull Request #276 · mlcommons/endpoints

viraatc · 2026-04-10T01:08:59Z

Keepalive probes caused connection timeout errors in offline and high-concurrency modes. Dead connections are already handled by connection_lost/eof_received callbacks in the protocol layer.

attempt to close #202

What does this PR do?

Type of change

Bug fix
New feature
Documentation update
Refactor/cleanup

Related issues

Testing

Tests added/updated
All tests pass locally
Manual testing completed

Checklist

Code follows project style
Pre-commit hooks pass
Documentation updated (if needed)

Keepalive probes caused connection timeout errors in offline and high-concurrency modes. Dead connections are already handled by connection_lost/eof_received callbacks in the protocol layer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-10T01:09:10Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

gemini-code-assist

Code Review

This pull request disables TCP keepalive for the inference endpoint client to mitigate connection timeout errors observed in high-concurrency environments. A review comment points out that disabling keepalive may lead to "zombie" connections during silent network drops, as the current implementation lacks timeouts in the read methods; it is recommended to implement application-level timeouts or use TCP_USER_TIMEOUT as a safeguard.

gemini-code-assist · 2026-04-10T01:10:53Z

src/inference_endpoint/endpoint_client/http.py

+    # NOTE(vir):
+    # we hit lots of connection timed out errors in offline and high-concurrency modes,
+    # disabling since we handle dead-connections in http.py connection_lost/eof_received


Disabling TCP keepalive can lead to "zombie" connections in the pool if a connection is silently dropped by a network middlebox (e.g., a firewall or NAT gateway). While connection_lost and eof_received handle active closures (FIN/RST), they cannot detect silent drops. Since the protocol's read_headers and read_body methods lack timeouts, a request using such a connection could hang indefinitely. It is recommended to ensure an application-level timeout is implemented elsewhere or to consider a non-zero TCP_USER_TIMEOUT (on Linux) as a safeguard.

Copilot

Pull request overview

Disables TCP keepalive probes by default in the endpoint HTTP client to reduce connection timeout errors observed in offline and high-concurrency modes (issue #202), relying instead on protocol/connection lifecycle handling in the client.

Changes:

Set _SocketConfig.SO_KEEPALIVE default to 0 (disabled).
Gate application of TCP_KEEP* tuning behind SO_KEEPALIVE being enabled.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/inference_endpoint/endpoint_client/http.py

+    # NOTE(vir):
+    # we hit lots of connection timed out errors in offline and high-concurrency modes,
+    # disabling since we handle dead-connections in http.py connection_lost/eof_received
+    SO_KEEPALIVE: int = 0  # Disabled
+    TCP_KEEPIDLE: int = 1  # Probe after 1s idle (disabled)
+    TCP_KEEPCNT: int = 5  # 5 failed probes = dead (disabled)
+    TCP_KEEPINTVL: int = 1  # 1s between probes (disabled)


src/inference_endpoint/endpoint_client/http.py

+    TCP_KEEPIDLE: int = 1  # Probe after 1s idle (disabled)
+    TCP_KEEPCNT: int = 5  # 5 failed probes = dead (disabled)
+    TCP_KEEPINTVL: int = 1  # 1s between probes (disabled)


viraatc requested review from a team and Copilot April 10, 2026 01:09

github-actions bot requested a review from arekay-nv April 10, 2026 01:09

github-actions bot requested a review from nvzhihanj April 10, 2026 01:09

Copilot started reviewing on behalf of viraatc April 10, 2026 01:09 View session

viraatc mentioned this pull request Apr 10, 2026

"max_throughput" mode causes connection timeouts #202

Open

gemini-code-assist bot reviewed Apr 10, 2026

View reviewed changes

Copilot AI reviewed Apr 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: disable TCP keepalive by default to avoid connection timeouts#276

fix: disable TCP keepalive by default to avoid connection timeouts#276
viraatc wants to merge 1 commit intomainfrom
feat/viraatc-drop-keepalive

viraatc commented Apr 10, 2026

Uh oh!

github-actions bot commented Apr 10, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

viraatc commented Apr 10, 2026

What does this PR do?

Type of change

Related issues

Testing

Checklist

Uh oh!

github-actions bot commented Apr 10, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants