Fix connect() crash on malformed CLAUDE_CODE_STREAM_CLOSE_TIMEOUT#763
Fix connect() crash on malformed CLAUDE_CODE_STREAM_CLOSE_TIMEOUT#763shuofengzhang wants to merge 2 commits intoanthropics:mainfrom
Conversation
|
@claude review |
| def _parse_timeout_ms_from_env( | ||
| env_var: str, default_timeout_ms: float = 60000.0 | ||
| ) -> float: | ||
| """Parse timeout milliseconds from environment, falling back safely. | ||
|
|
||
| Accepts integer or float-like strings. Invalid or non-finite values return | ||
| the provided default. | ||
| """ | ||
| raw = os.environ.get(env_var) | ||
| if raw is None: | ||
| return default_timeout_ms | ||
|
|
||
| try: | ||
| parsed = float(raw) | ||
| except (TypeError, ValueError): | ||
| return default_timeout_ms | ||
|
|
||
| if not math.isfinite(parsed): | ||
| return default_timeout_ms | ||
|
|
||
| return parsed | ||
|
|
||
|
|
There was a problem hiding this comment.
🔴 The _parse_timeout_ms_from_env() helper is only applied in ClaudeSDKClient.connect() but the identical vulnerable int() pattern remains in src/claude_agent_sdk/_internal/client.py:103-105 (the query() one-shot path). Setting CLAUDE_CODE_STREAM_CLOSE_TIMEOUT=60s will still crash query() with a ValueError. The fix should use _parse_timeout_ms_from_env in both locations.
Extended reasoning...
What the bug is
This PR introduces a _parse_timeout_ms_from_env() helper in src/claude_agent_sdk/client.py that safely parses the CLAUDE_CODE_STREAM_CLOSE_TIMEOUT environment variable, falling back to a default when the value is malformed (e.g., "60s", "inf", or other non-numeric strings). The helper is correctly applied in ClaudeSDKClient.connect() at line 178. However, the identical vulnerable pattern was not updated in the separate query() one-shot code path at src/claude_agent_sdk/_internal/client.py:103-105.
The specific code path
In src/claude_agent_sdk/_internal/client.py, lines 103-105:
initialize_timeout_ms = int(
os.environ.get("CLAUDE_CODE_STREAM_CLOSE_TIMEOUT", "60000")
)This is the code path used when calling the top-level query() function (the one-shot, non-streaming API). The comment on line 102 even says # Match ClaudeSDKClient.connect(), confirming the intent is for these two locations to stay in sync.
Step-by-step proof
- A user sets
CLAUDE_CODE_STREAM_CLOSE_TIMEOUT=60sin their environment (a plausible mistake, mixing up seconds vs milliseconds notation). - They call
query(prompt="Hello")which enters the one-shot code path in_internal/client.py. - At line 103,
os.environ.get("CLAUDE_CODE_STREAM_CLOSE_TIMEOUT", "60000")returns"60s". int("60s")raisesValueError: invalid literal for int() with base 10: 60s.- The SDK crashes before it can even start the query, with no fallback.
Meanwhile, the same environment variable is now handled gracefully in ClaudeSDKClient.connect() thanks to this PR — creating an inconsistency where the streaming path works but the one-shot path crashes.
Impact
This is the exact same class of bug the PR aims to fix. Users who use the query() function (which is the simpler, recommended API for one-off prompts) remain vulnerable to the crash, while only users of the streaming ClaudeSDKClient are protected. The PR description states the goal is to avoid "startup failures in real automation workflows where environment variables can be inherited, templated, or accidentally misformatted" — but that goal is only half-achieved.
How to fix
Import and use the _parse_timeout_ms_from_env helper from client.py in _internal/client.py, replacing the int(os.environ.get(...)) call on lines 103-105 with:
from claude_agent_sdk.client import _parse_timeout_ms_from_env
initialize_timeout_ms = _parse_timeout_ms_from_env(
"CLAUDE_CODE_STREAM_CLOSE_TIMEOUT"
)Alternatively, move the helper to a shared utility module to avoid circular imports, or duplicate it in _internal/client.py.
|
|
||
| def test_connect_with_invalid_timeout_env_falls_back_to_default(self): | ||
| """Malformed timeout env values should not break connect().""" | ||
|
|
||
| async def _test(): | ||
| with ( | ||
| patch.dict( | ||
| os.environ, | ||
| {"CLAUDE_CODE_STREAM_CLOSE_TIMEOUT": "60s"}, | ||
| clear=False, | ||
| ), | ||
| patch( | ||
| "claude_agent_sdk._internal.transport.subprocess_cli.SubprocessCLITransport" | ||
| ) as mock_transport_class, | ||
| ): | ||
| mock_transport = create_mock_transport() | ||
| mock_transport_class.return_value = mock_transport | ||
|
|
||
| client = ClaudeSDKClient() | ||
| await client.connect() | ||
|
|
||
| assert client._query is not None | ||
| # Default is 60000ms, and initialize timeout enforces a 60s minimum. | ||
| assert client._query._initialize_timeout == 60.0 | ||
|
|
||
| await client.disconnect() | ||
|
|
||
| anyio.run(_test) |
There was a problem hiding this comment.
🟡 Nit: The assertion client._query._initialize_timeout == 60.0 does not actually verify that the fallback value is correct. Because the downstream code applies max(..., 60.0), any parsed value <= 60000 (including 0 or -1 from a broken fallback) would produce the same 60.0 result. Consider adding a companion test with a valid high value like "120000" to confirm parsing works, then testing "60s" to show it falls back.
Extended reasoning...
What the issue is
The test test_connect_with_invalid_timeout_env_falls_back_to_default sets CLAUDE_CODE_STREAM_CLOSE_TIMEOUT="60s" and asserts client._query._initialize_timeout == 60.0. The intent is to verify that malformed env values fall back to the default of 60000ms (which becomes 60.0s). However, the assertion cannot distinguish a correct fallback from a broken one.
Why the assertion is vacuous for fallback verification
The production code path is:
initialize_timeout_ms = _parse_timeout_ms_from_env("CLAUDE_CODE_STREAM_CLOSE_TIMEOUT") # default: 60000.0
initialize_timeout = max(initialize_timeout_ms / 1000.0, 60.0)Walk through the concrete scenarios:
- Correct fallback (returns 60000.0):
max(60000.0 / 1000.0, 60.0)=max(60.0, 60.0)= 60.0 ✓ - Broken fallback returning 0:
max(0.0 / 1000.0, 60.0)=max(0.0, 60.0)= 60.0 ✓ - Broken fallback returning -1:
max(-0.001, 60.0)= 60.0 ✓ - Broken fallback returning 30000:
max(30.0, 60.0)= 60.0 ✓
All of these pass the assertion. The max(..., 60.0) clamp masks any incorrect fallback value that happens to be <= 60000.
What the test does verify correctly
The test does successfully verify its primary stated goal: "Malformed timeout env values should not break connect()." The fact that connect() completes without raising ValueError is the real regression test, and that part works.
Impact
This is a test quality issue, not a production bug. If someone accidentally changed the default in _parse_timeout_ms_from_env from 60000.0 to 0, the production code would still behave correctly (the max() clamp ensures at least 60s), but the test would give false confidence that the fallback value itself is correct.
Suggested improvement
Add a test with a valid high value to confirm parsing works end-to-end:
- Set
CLAUDE_CODE_STREAM_CLOSE_TIMEOUT="120000"→ assert_initialize_timeout == 120.0(proves parsing works and the value exceeds the 60s floor) - Then set
CLAUDE_CODE_STREAM_CLOSE_TIMEOUT="60s"→ assert_initialize_timeout == 60.0(now this assertion is meaningful by contrast)
Alternatively, test _parse_timeout_ms_from_env directly to verify it returns 60000.0 for invalid input.
What changed
_parse_timeout_ms_from_env()insrc/claude_agent_sdk/client.pyto defensively parse timeout env values.ClaudeSDKClient.connect()to use the helper when readingCLAUDE_CODE_STREAM_CLOSE_TIMEOUT.tests/test_streaming_client.py::TestClaudeSDKClientEdgeCases::test_connect_with_invalid_timeout_env_falls_back_to_defaultWhy
ClaudeSDKClient.connect()previously usedint(os.environ[...])for timeout parsing."60s"), connection setup raisesValueErrorbefore the SDK can initialize.Insight / Why this matters
Practical gain / Why this matters
ClaudeSDKClientin scripts/services get more reliable startup behavior.Testing
.venv/bin/pytest -q tests/test_streaming_client.py::TestClaudeSDKClientEdgeCases::test_connect_with_invalid_timeout_env_falls_back_to_default.venv/bin/pytest -qValidation evidence:
CLAUDE_CODE_STREAM_CLOSE_TIMEOUT=60sand verifiesconnect()succeeds and fallback timeout is applied.Risk analysis: