Skip to content

Fix connect() crash on malformed CLAUDE_CODE_STREAM_CLOSE_TIMEOUT#763

Open
shuofengzhang wants to merge 2 commits intoanthropics:mainfrom
shuofengzhang:fix-invalid-timeout-env-parsing
Open

Fix connect() crash on malformed CLAUDE_CODE_STREAM_CLOSE_TIMEOUT#763
shuofengzhang wants to merge 2 commits intoanthropics:mainfrom
shuofengzhang:fix-invalid-timeout-env-parsing

Conversation

@shuofengzhang
Copy link
Copy Markdown

What changed

  • Added _parse_timeout_ms_from_env() in src/claude_agent_sdk/client.py to defensively parse timeout env values.
  • Updated ClaudeSDKClient.connect() to use the helper when reading CLAUDE_CODE_STREAM_CLOSE_TIMEOUT.
  • Added a regression test:
    • tests/test_streaming_client.py::TestClaudeSDKClientEdgeCases::test_connect_with_invalid_timeout_env_falls_back_to_default

Why

  • ClaudeSDKClient.connect() previously used int(os.environ[...]) for timeout parsing.
  • If the env var is malformed (for example "60s"), connection setup raises ValueError before the SDK can initialize.
  • This change keeps behavior identical for valid numeric values, while safely falling back to defaults for malformed/non-finite input.

Insight / Why this matters

  • Root cause: strict integer parsing on an externally provided environment variable.
  • Why easy to miss: most local/dev setups leave this variable unset or numeric, so malformed inherited env values are rarely tested.
  • Tradeoff: we now tolerate malformed values instead of hard-failing early. This slightly reduces strictness but improves operational robustness in CI/shell environments.
  • Impact: avoids startup failures in real automation workflows where environment variables can be inherited, templated, or accidentally misformatted.

Practical gain / Why this matters

  • Users embedding ClaudeSDKClient in scripts/services get more reliable startup behavior.
  • A single malformed env value no longer crashes initialization.
  • The change is narrow and rollback-safe (parsing path only, no protocol/transport changes).

Testing

  • .venv/bin/pytest -q tests/test_streaming_client.py::TestClaudeSDKClientEdgeCases::test_connect_with_invalid_timeout_env_falls_back_to_default
  • .venv/bin/pytest -q

Validation evidence:

  • New regression test sets CLAUDE_CODE_STREAM_CLOSE_TIMEOUT=60s and verifies connect() succeeds and fallback timeout is applied.

Risk analysis:

  • Low risk. Only timeout env parsing is touched.
  • No behavior change for valid numeric values.
  • If maintainers prefer strict failure semantics, this can be reverted by removing the helper and test in a single commit.

@km-anthropic
Copy link
Copy Markdown
Collaborator

@claude review

Comment on lines +24 to +46
def _parse_timeout_ms_from_env(
env_var: str, default_timeout_ms: float = 60000.0
) -> float:
"""Parse timeout milliseconds from environment, falling back safely.

Accepts integer or float-like strings. Invalid or non-finite values return
the provided default.
"""
raw = os.environ.get(env_var)
if raw is None:
return default_timeout_ms

try:
parsed = float(raw)
except (TypeError, ValueError):
return default_timeout_ms

if not math.isfinite(parsed):
return default_timeout_ms

return parsed


Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The _parse_timeout_ms_from_env() helper is only applied in ClaudeSDKClient.connect() but the identical vulnerable int() pattern remains in src/claude_agent_sdk/_internal/client.py:103-105 (the query() one-shot path). Setting CLAUDE_CODE_STREAM_CLOSE_TIMEOUT=60s will still crash query() with a ValueError. The fix should use _parse_timeout_ms_from_env in both locations.

Extended reasoning...

What the bug is

This PR introduces a _parse_timeout_ms_from_env() helper in src/claude_agent_sdk/client.py that safely parses the CLAUDE_CODE_STREAM_CLOSE_TIMEOUT environment variable, falling back to a default when the value is malformed (e.g., "60s", "inf", or other non-numeric strings). The helper is correctly applied in ClaudeSDKClient.connect() at line 178. However, the identical vulnerable pattern was not updated in the separate query() one-shot code path at src/claude_agent_sdk/_internal/client.py:103-105.

The specific code path

In src/claude_agent_sdk/_internal/client.py, lines 103-105:

initialize_timeout_ms = int(
    os.environ.get("CLAUDE_CODE_STREAM_CLOSE_TIMEOUT", "60000")
)

This is the code path used when calling the top-level query() function (the one-shot, non-streaming API). The comment on line 102 even says # Match ClaudeSDKClient.connect(), confirming the intent is for these two locations to stay in sync.

Step-by-step proof

  1. A user sets CLAUDE_CODE_STREAM_CLOSE_TIMEOUT=60s in their environment (a plausible mistake, mixing up seconds vs milliseconds notation).
  2. They call query(prompt="Hello") which enters the one-shot code path in _internal/client.py.
  3. At line 103, os.environ.get("CLAUDE_CODE_STREAM_CLOSE_TIMEOUT", "60000") returns "60s".
  4. int("60s") raises ValueError: invalid literal for int() with base 10: 60s.
  5. The SDK crashes before it can even start the query, with no fallback.

Meanwhile, the same environment variable is now handled gracefully in ClaudeSDKClient.connect() thanks to this PR — creating an inconsistency where the streaming path works but the one-shot path crashes.

Impact

This is the exact same class of bug the PR aims to fix. Users who use the query() function (which is the simpler, recommended API for one-off prompts) remain vulnerable to the crash, while only users of the streaming ClaudeSDKClient are protected. The PR description states the goal is to avoid "startup failures in real automation workflows where environment variables can be inherited, templated, or accidentally misformatted" — but that goal is only half-achieved.

How to fix

Import and use the _parse_timeout_ms_from_env helper from client.py in _internal/client.py, replacing the int(os.environ.get(...)) call on lines 103-105 with:

from claude_agent_sdk.client import _parse_timeout_ms_from_env

initialize_timeout_ms = _parse_timeout_ms_from_env(
    "CLAUDE_CODE_STREAM_CLOSE_TIMEOUT"
)

Alternatively, move the helper to a shared utility module to avoid circular imports, or duplicate it in _internal/client.py.

Comment on lines 1206 to +1233

def test_connect_with_invalid_timeout_env_falls_back_to_default(self):
"""Malformed timeout env values should not break connect()."""

async def _test():
with (
patch.dict(
os.environ,
{"CLAUDE_CODE_STREAM_CLOSE_TIMEOUT": "60s"},
clear=False,
),
patch(
"claude_agent_sdk._internal.transport.subprocess_cli.SubprocessCLITransport"
) as mock_transport_class,
):
mock_transport = create_mock_transport()
mock_transport_class.return_value = mock_transport

client = ClaudeSDKClient()
await client.connect()

assert client._query is not None
# Default is 60000ms, and initialize timeout enforces a 60s minimum.
assert client._query._initialize_timeout == 60.0

await client.disconnect()

anyio.run(_test)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: The assertion client._query._initialize_timeout == 60.0 does not actually verify that the fallback value is correct. Because the downstream code applies max(..., 60.0), any parsed value <= 60000 (including 0 or -1 from a broken fallback) would produce the same 60.0 result. Consider adding a companion test with a valid high value like "120000" to confirm parsing works, then testing "60s" to show it falls back.

Extended reasoning...

What the issue is

The test test_connect_with_invalid_timeout_env_falls_back_to_default sets CLAUDE_CODE_STREAM_CLOSE_TIMEOUT="60s" and asserts client._query._initialize_timeout == 60.0. The intent is to verify that malformed env values fall back to the default of 60000ms (which becomes 60.0s). However, the assertion cannot distinguish a correct fallback from a broken one.

Why the assertion is vacuous for fallback verification

The production code path is:

initialize_timeout_ms = _parse_timeout_ms_from_env("CLAUDE_CODE_STREAM_CLOSE_TIMEOUT")  # default: 60000.0
initialize_timeout = max(initialize_timeout_ms / 1000.0, 60.0)

Walk through the concrete scenarios:

  • Correct fallback (returns 60000.0): max(60000.0 / 1000.0, 60.0) = max(60.0, 60.0) = 60.0
  • Broken fallback returning 0: max(0.0 / 1000.0, 60.0) = max(0.0, 60.0) = 60.0
  • Broken fallback returning -1: max(-0.001, 60.0) = 60.0
  • Broken fallback returning 30000: max(30.0, 60.0) = 60.0

All of these pass the assertion. The max(..., 60.0) clamp masks any incorrect fallback value that happens to be <= 60000.

What the test does verify correctly

The test does successfully verify its primary stated goal: "Malformed timeout env values should not break connect()." The fact that connect() completes without raising ValueError is the real regression test, and that part works.

Impact

This is a test quality issue, not a production bug. If someone accidentally changed the default in _parse_timeout_ms_from_env from 60000.0 to 0, the production code would still behave correctly (the max() clamp ensures at least 60s), but the test would give false confidence that the fallback value itself is correct.

Suggested improvement

Add a test with a valid high value to confirm parsing works end-to-end:

  1. Set CLAUDE_CODE_STREAM_CLOSE_TIMEOUT="120000" → assert _initialize_timeout == 120.0 (proves parsing works and the value exceeds the 60s floor)
  2. Then set CLAUDE_CODE_STREAM_CLOSE_TIMEOUT="60s" → assert _initialize_timeout == 60.0 (now this assertion is meaningful by contrast)

Alternatively, test _parse_timeout_ms_from_env directly to verify it returns 60000.0 for invalid input.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants