Skip to content

Conversation

@enyst
Copy link
Collaborator

@enyst enyst commented Nov 28, 2025

This PR proposes to make the wsproto test more robust, it can happen that terminate fails in pytest.

—-

(Agent Summary)

  • Record creator_pid at fixture setup and ensure teardown only runs in the original parent process

Why

  • pytest session-scoped fixtures can teardown in forked workers (e.g., pytest-xdist, pytest-forked), causing assertions if a process tries to manage a process it didn’t create

How

  • Capture creator_pid = os.getpid() when starting the child
  • During teardown, guard with if os.getpid() == creator_pid: before terminate/join/kill
  • Keep the simple liveness-based shutdown sequence

Result

  • Clean, explicit, and portable teardown logic
  • Avoids private attribute access
  • Verified locally: pre-commit hooks pass and the wsproto tests pass

Co-authored-by: openhands [email protected]

@enyst can click here to continue refining the PR

…ate attr\n\n- Record creator_pid and only tear down from original parent\n- Avoid using multiprocessing.Process._parent_pid which is private\n- Minor safety: terminate only if alive on startup failure\n\nCo-authored-by: openhands <[email protected]>
@enyst
Copy link
Collaborator Author

enyst commented Nov 29, 2025

@OpenHands Think deeply, run the entire test suite, then run this test in isolation, and come up with a few reasons why was this flaky, which MIGHT NOT be the same reason this PR attempts to fix.

/codereview-roasted

@openhands-ai
Copy link

openhands-ai bot commented Nov 29, 2025

I'm on it! enyst can track my progress at all-hands.dev

@openhands-ai
Copy link

openhands-ai bot commented Nov 29, 2025

Summary of work and answer

What I did

  • Built the project: make build
  • Ran the entire test suite: uv run pytest
    • Observed an early failure in tests/agent_server/test_agent_server_wsproto.py during the full run (.F).
    • Saw many “Exception ignored in atexit callback” messages from LocalConversation.close complaining “Agent not initialized,” likely unrelated noise but indicative of lifecycle issues elsewhere.
  • Ran the wsproto tests in isolation: uv run pytest tests/agent_server/test_agent_server_wsproto.py
    • Passed consistently (2 tests).
  • Stress-ran the wsproto file 10 times back-to-back
    • 10/10 green; couldn’t reproduce flakiness in isolation.

Why this was flaky (beyond the PR’s teardown guard)

  • Port selection race (TOCTOU): The test finds a “free” port, closes it, then starts the server. Under load/parallelism, some other test can grab the port first, causing intermittent startup failures.
  • Heavy readiness probe: Using /docs for readiness can be slow under load (OpenAPI UI generation/serving). Combined with a tight per-request timeout (1s), this can sporadically fail the health check loop even though the server is about to be ready.
  • Session-scoped lifetime: The server persists across the entire test session. Other tests can interfere with it (state changes, shutdown paths, CPU contention), making failures order-dependent.
  • Start-method/platform variability: spawn vs fork differences (CI/OS dependent) can change startup timing and behavior, contributing to occasional flakes.
  • Websocket timing sensitivity: Short open_timeout/recv windows (5s/2s) are mostly fine here, but under heavy CI load they can still be borderline.
  • Atexit cleanup noise elsewhere: The LocalConversation.close tracebacks suggest lifecycle inconsistencies. While not directly causing this test to fail, they add nondeterminism during global teardown.

Bottom line

  • The PR’s fix (guard teardown to only run in the creating process) addresses one real flake vector with xdist/forked teardown, but the biggest contributors are the free-port race and the heavy /docs readiness probe. Even after merging, I’d expect occasional flakes until those are tightened.
  • Simple mitigation ideas:
    • Avoid TOCTOU free-port selection (ideally start the server on port 0 and read back the bound port; otherwise reduce the race window).
    • Use a lightweight health endpoint with a slightly higher request timeout (e.g., 3–5s per attempt).
    • Consider narrower fixture scope if feasible to reduce cross-test interference.
    • Clean up the atexit lifecycle errors in LocalConversation to reduce teardown noise.

View full conversation

@enyst enyst marked this pull request as draft November 29, 2025 23:01
@enyst enyst changed the title tests: make wsproto agent_server fixture teardown robust without private attr tests: make wsproto agent_server fixture teardown robust Nov 30, 2025
@blacksmith-sh
Copy link
Contributor

blacksmith-sh bot commented Dec 16, 2025

[Automatic Post]: It has been a while since there was any activity on this PR. @enyst, are you still working on it? If so, please go ahead, if not then please request review, close it, or request that someone else follow up.

@enyst
Copy link
Collaborator Author

enyst commented Dec 20, 2025

@OpenHands Read the entire PR diff and PR comments.

Please fix these two:

  1. Port selection race (TOCTOU): The test finds a “free” port, closes it, then starts the server. Under load/parallelism, some other test can grab the port first, causing intermittent startup failures.
  2. Heavy readiness probe: Using /docs for readiness can be slow under load (OpenAPI UI generation/serving). Combined with a tight per-request timeout (1s), this can sporadically fail the health check loop even though the server is about to be ready.

@openhands-ai
Copy link

openhands-ai bot commented Dec 20, 2025

I'm on it! enyst can track my progress at all-hands.dev

@openhands-ai
Copy link

openhands-ai bot commented Dec 20, 2025

Summary of changes pushed to chore/robust-wsproto-teardown

Your two requested fixes have been implemented in a single, focused change to tests/agent_server/test_agent_server_wsproto.py and pushed to the existing PR branch.

What I fixed

  1. Port selection TOCTOU race
  • Pre-bind a TCP socket on 127.0.0.1:0 (SO_REUSEADDR, listen) and pass its file descriptor directly to uvicorn via fd=.
  • Close the parent-side socket after starting the child so the child exclusively owns the listener.
  • This removes the “find free port, close it, start server” window where another process could grab the port.
  1. Heavy readiness probe (/docs) under load
  • Switched readiness loop to GET /alive with a slightly more tolerant approach:
    • timeout=3s, 0.25s backoff, up to ~15s total.
  • Avoids the overhead of OpenAPI UI rendering/serving during readiness.

Additional robustness for auth and WebSockets

  • Ensure the spawned server reads auth config consistently:
    • Set both OH_SESSION_API_KEYS and legacy SESSION_API_KEY in the child.
    • Build a fresh Config from env in the child and initialize the FastAPI app with create_app(config).
    • Synchronize the module-level default config so sockets.py (which calls get_default_config()) uses the same session_api_keys, preventing intermittent 401/403s on WS connections.

Verification

  • Pre-commit hooks on the changed file: passed (ruff format/lint, pycodestyle, pyright).
  • tests/agent_server/test_agent_server_wsproto.py: 2 tests passed.
  • Full agent_server suite: 455 passed, 17 warnings.

Conciseness

  • Only one file changed: tests/agent_server/test_agent_server_wsproto.py
  • No extraneous edits beyond addressing the two issues and ensuring stable auth/WS behavior for the fixture.

Checklist

  • Port selection race eliminated via fd-based binding
  • Readiness probe switched to /alive with tolerant loop
  • WS auth aligned with runtime config (no 401/403)
  • Pre-commit and tests pass
  • Minimal, targeted diff

These changes have been committed and pushed to chore/robust-wsproto-teardown with Co-authored-by: openhands [email protected].

View full conversation

@all-hands-bot
Copy link
Collaborator

[Automatic Post]: It has been a while since there was any activity on this PR. @enyst, are you still working on it? If so, please go ahead, if not then please request review, close it, or request that someone else follow up.

2 similar comments
@all-hands-bot
Copy link
Collaborator

[Automatic Post]: It has been a while since there was any activity on this PR. @enyst, are you still working on it? If so, please go ahead, if not then please request review, close it, or request that someone else follow up.

@all-hands-bot
Copy link
Collaborator

[Automatic Post]: It has been a while since there was any activity on this PR. @enyst, are you still working on it? If so, please go ahead, if not then please request review, close it, or request that someone else follow up.

@enyst enyst closed this Feb 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants