Skip to content

fix: serialize tmux session creation to prevent race conditions#1889

Merged
xingyaoww merged 13 commits intomainfrom
fix/tmux-session-race-condition
Feb 4, 2026
Merged

fix: serialize tmux session creation to prevent race conditions#1889
xingyaoww merged 13 commits intomainfrom
fix/tmux-session-race-condition

Conversation

@neubig
Copy link
Contributor

@neubig neubig commented Feb 3, 2026

Problem

When multiple parallel agents share the same Docker workspace (same container), they all attempt to create tmux sessions on the same tmux server simultaneously. The libtmux library has a race condition where new_session() creates a session but then fails to find it immediately, throwing TmuxObjectDoesNotExist.

This causes agents to fail with:

Failed to create tmux session after 3 attempts

Solution

Use file-based locking (fcntl.flock()) to serialize tmux session creation within the container. This ensures only one session is created at a time, eliminating the race condition.

Changes

  • Added _TMUX_LOCK_FILE = "/tmp/openhands-tmux-session.lock"
  • Added _TMUX_THREAD_LOCK = threading.Lock() for intra-process safety
  • Wrapped session creation with both locks (file lock for cross-process, thread lock for same-process)

Testing

Tested with 20 parallel conversations in a single Docker workspace - all tmux sessions created successfully (0 failures).


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.13-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:aedb881-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-aedb881-python \
  ghcr.io/openhands/agent-server:aedb881-python

All tags pushed for this build

ghcr.io/openhands/agent-server:aedb881-golang-amd64
ghcr.io/openhands/agent-server:aedb881-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:aedb881-golang-arm64
ghcr.io/openhands/agent-server:aedb881-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:aedb881-java-amd64
ghcr.io/openhands/agent-server:aedb881-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:aedb881-java-arm64
ghcr.io/openhands/agent-server:aedb881-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:aedb881-python-amd64
ghcr.io/openhands/agent-server:aedb881-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-amd64
ghcr.io/openhands/agent-server:aedb881-python-arm64
ghcr.io/openhands/agent-server:aedb881-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-arm64
ghcr.io/openhands/agent-server:aedb881-golang
ghcr.io/openhands/agent-server:aedb881-java
ghcr.io/openhands/agent-server:aedb881-python

About Multi-Architecture Support

  • Each variant tag (e.g., aedb881-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., aedb881-python-amd64) are also available if needed

When many parallel agents start simultaneously, they all attempt to
create tmux sessions at once. The libtmux library has race conditions
where new_session() can create a session but then fail to find it
immediately, throwing TmuxObjectDoesNotExist.

This fix adds a global threading.Lock to serialize tmux session
creation, eliminating the race condition.

Co-authored-by: openhands <[email protected]>
…nization

The threading.Lock() only works within a single process. When uvicorn
spawns multiple workers or when there are multiple Python processes,
each has its own lock.

This change uses fcntl.flock() on a shared lock file, which works
across processes since they all share the same filesystem.
…ialization

Use fcntl.flock() for cross-process synchronization of tmux session creation.
This prevents race conditions when multiple processes (e.g., uvicorn workers)
try to create tmux sessions simultaneously.

The file lock is combined with a threading lock to handle both:
- Multi-process concurrency (file lock)
- Multi-thread concurrency within same process (thread lock)
@neubig neubig force-pushed the fix/tmux-session-race-condition branch from 2b32a22 to 8e576fb Compare February 3, 2026 20:50
Serialize tmux session creation using fcntl.flock() to prevent race
conditions when multiple agents share the same Docker workspace.

The libtmux library has a race condition where new_session() can create
a session but then fail to find it immediately. This fix ensures only
one session is created at a time within the container.

Co-authored-by: openhands <[email protected]>
@neubig neubig force-pushed the fix/tmux-session-race-condition branch from 8e576fb to eb1b0a1 Compare February 3, 2026 20:57
@neubig neubig requested a review from xingyaoww February 3, 2026 20:58
@neubig neubig marked this pull request as ready for review February 3, 2026 20:58
@neubig neubig changed the title DRAFT: fix: serialize tmux session creation to prevent race conditions fix: serialize tmux session creation to prevent race conditions Feb 3, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Feb 3, 2026

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-tools/openhands/tools/terminal/terminal
   tmux_terminal.py1133172%51, 58, 83–86, 90–91, 93, 134, 138, 142, 153, 164, 178, 190–197, 205–206, 208–209, 211, 213–215
TOTAL17461874349% 

Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good approach to fixing the race condition! The dual-lock strategy (thread + file) is sound. Found a few issues that should be addressed before merging.


# Lock for serializing tmux session creation to prevent race conditions in libtmux
# Uses both file lock (cross-process) and thread lock (same-process)
_TMUX_LOCK_FILE = "/tmp/openhands-tmux-session.lock"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Important: Lock file permissions could cause issues in multi-user environments.

If user A creates this file first, user B may not have permission to access it. Consider using a user-specific lock file path:

Suggested change
_TMUX_LOCK_FILE = "/tmp/openhands-tmux-session.lock"
_TMUX_LOCK_FILE = f"/tmp/openhands-tmux-session-{os.getenv(\"USER\", \"default\")}.lock"

Alternatively, ensure the lock file has permissive permissions (0o666) when created.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^this could be real

Also if we use file-based lock, why do we need thread lock? Are both of them necessary at the same time?

Comment on lines 67 to 68
lock_fd = os.open(_TMUX_LOCK_FILE, os.O_CREAT | os.O_RDWR)
try:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Important: Potential file descriptor leak if an exception occurs between os.open() and the try block.

Move the os.open() inside the try block to ensure the fd is always closed:

Suggested change
lock_fd = os.open(_TMUX_LOCK_FILE, os.O_CREAT | os.O_RDWR)
try:
try:
lock_fd = os.open(_TMUX_LOCK_FILE, os.O_CREAT | os.O_RDWR)

for attempt in range(max_retries):
# Serialize tmux session creation to prevent libtmux race conditions
with _TMUX_THREAD_LOCK:
lock_fd = os.open(_TMUX_LOCK_FILE, os.O_CREAT | os.O_RDWR)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: Consider setting explicit permissions on the lock file for better multi-user compatibility.

You could add mode=0o666 or use os.chmod() after creation to ensure all users can access the lock file.

Comment on lines 65 to 98
# Serialize tmux session creation to prevent libtmux race conditions
with _TMUX_THREAD_LOCK:
lock_fd = os.open(_TMUX_LOCK_FILE, os.O_CREAT | os.O_RDWR)
try:
self.session = self.server.new_session(
session_name=session_name,
start_directory=self.work_dir,
kill_session=True,
x=1000,
y=1000,
)
break
except TmuxObjectDoesNotExist as e:
last_error = e
if attempt < max_retries - 1:
logger.warning(
f"Tmux session creation failed (attempt {attempt + 1}/"
f"{max_retries}), retrying in {retry_delay}s: {e}"
)
time.sleep(retry_delay)
retry_delay *= 2 # Exponential backoff
else:
raise RuntimeError(
f"Failed to create tmux session after {max_retries} attempts"
) from last_error
fcntl.flock(lock_fd, fcntl.LOCK_EX)
max_retries = 3
retry_delay = 0.5
last_error = None
for attempt in range(max_retries):
try:
self.session = self.server.new_session(
session_name=session_name,
start_directory=self.work_dir,
kill_session=True,
x=1000,
y=1000,
)
break
except TmuxObjectDoesNotExist as e:
last_error = e
if attempt < max_retries - 1:
logger.warning(
f"Tmux session creation failed (attempt {attempt + 1}/"
f"{max_retries}), retrying in {retry_delay}s: {e}"
)
time.sleep(retry_delay)
retry_delay *= 2
else:
raise RuntimeError(
f"Failed to create tmux session after {max_retries} attempts"
) from last_error
finally:
fcntl.flock(lock_fd, fcntl.LOCK_UN)
os.close(lock_fd)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: The retry logic might be redundant now with proper locking in place.

The retries were originally added to work around the race condition that the locks now prevent. Consider removing the retry logic once this is proven stable in production, as it adds complexity and delay for genuine errors.

That said, keeping it as a defensive measure is reasonable if you want extra safety.

last_error = None
for attempt in range(max_retries):
# Serialize tmux session creation to prevent libtmux race conditions
with _TMUX_THREAD_LOCK:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Nit: Lock ordering is correct (thread lock before file lock). Good practice to avoid potential deadlocks.

@neubig
Copy link
Contributor Author

neubig commented Feb 3, 2026

@OpenHands reflect all the review comments, and try to keep the overall PR changes as concise as possible

@openhands-ai
Copy link

openhands-ai bot commented Feb 3, 2026

I'm on it! neubig can track my progress at all-hands.dev

- Move os.open() inside try block to prevent fd leak on exception
- Add mode=0o666 for multi-user compatibility
@openhands-ai
Copy link

openhands-ai bot commented Feb 3, 2026

I've addressed the review comments on PR #1889 and pushed the changes. Here's a summary:

Review Comments Addressed

  1. ✅ Important: Potential file descriptor leak - Moved os.open() inside the try block and added lock_fd = None initialization to ensure the fd is always properly closed even if an exception occurs.

  2. ✅ Important: Lock file permissions for multi-user environments - Added mode=0o666 to the os.open() call to ensure all users can access the lock file.

  3. ℹ️ Suggestion: Retry logic redundancy - Kept the retry logic as a defensive measure per the reviewer's note that "keeping it as a defensive measure is reasonable."

  4. ✅ Nit: Lock ordering - No change needed; the reviewer confirmed the lock ordering (thread lock before file lock) is correct.

Changes Made

The commit adds only 5 lines and removes 3 lines, keeping the changes minimal and focused:

  • lock_fd = None initialization before try block
  • os.open() moved inside try block with 0o666 mode
  • Added if lock_fd is not None: check in finally block

The changes have been pushed to the fix/tmux-session-race-condition branch.

View full conversation

@enyst
Copy link
Collaborator

enyst commented Feb 4, 2026

@OpenHands Do a /codereview-roasted on this PR. Do not exaggerate, investigate tmux and understand the problem. You could also try to replicate it in your environment.

Post your feedback as a comment on the PR. Note that it will be rendered as markdown.

@openhands-ai
Copy link

openhands-ai bot commented Feb 4, 2026

I'm on it! enyst can track my progress at all-hands.dev

Copy link
Collaborator

enyst commented Feb 4, 2026

/codereview-roasted

Taste rating: 🟡 Acceptable — fixes a real race, but the locking is a bit heavy‑handed and could be safer.

Linus‑style analysis

[CRITICAL ISSUES]

  • None found. The core idea (serialize new_session() because libtmux races) is pragmatic and doesn’t break userspace.

[IMPROVEMENT OPPORTUNITIES]

  • [openhands-tools/openhands/tools/terminal/terminal/tmux_terminal.py, Lines 65–100] Lock scope is wider than needed: the file lock is held across retries and backoff sleeps. That means one slow/failed creator blocks every other session for up to ~1.5s per retry. Consider only holding the lock for the actual new_session() call (release before sleep, reacquire before retry) so other processes aren’t stalled unnecessarily.
  • [openhands-tools/openhands/tools/terminal/terminal/tmux_terminal.py, Line 69] FD inheritance risk: os.open(..., os.O_CREAT | os.O_RDWR) does not set O_CLOEXEC. If libtmux spawns subprocesses (it does), the lock fd can leak into children and keep the lock held longer than expected. Use os.O_CLOEXEC or os.set_inheritable(lock_fd, False) after open.
  • [openhands-tools/openhands/tools/terminal/terminal/tmux_terminal.py, Line 23] Global lock namespace: the lock file is global (/tmp/openhands-tmux-session.lock) regardless of tmux socket/server. If we ever run multiple tmux servers in one container (custom socket names, tests, or user isolation), this serializes unrelated servers. Consider namespacing by tmux socket path or username to avoid unnecessary contention.

[STYLE NOTES]

  • [openhands-tools/openhands/tools/terminal/terminal/tmux_terminal.py, Lines 21–24] The comment is fine, but the behavior would be clearer if the retry constants were lifted to module constants alongside the lock name for visibility.

Repro note
I tried to reproduce the libtmux race locally by spawning 20–60 parallel new_session() calls via multiprocessing and didn’t hit failures. That doesn’t disprove the issue — just means it’s load/latency sensitive and likely depends on the container’s tmux server state.

Verdict:Worth merging, but please consider tightening the lock scope and preventing FD inheritance to avoid hidden stalls.

Key insight: This is the right fix direction, but the lock should be as small and as safe as possible — otherwise we trade a rare race for steady contention.

@openhands-ai

This comment was marked as duplicate.

@xingyaoww
Copy link
Collaborator

@OpenHands do a /codereview-roasted, consider these comment i raised: #1889 (comment)

@openhands-ai
Copy link

openhands-ai bot commented Feb 4, 2026

I'm on it! xingyaoww can track my progress at all-hands.dev

# Lock for serializing tmux session creation to prevent race conditions in libtmux
# Uses both file lock (cross-process) and thread lock (same-process)
_TMUX_LOCK_FILE = "/tmp/openhands-tmux-session.lock"
_TMUX_THREAD_LOCK = threading.Lock()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/codereview-roasted

Taste rating: 🟡 Acceptable — fixes a real race condition, but the locking mechanism is over-engineered.

Linus-Style Analysis

[CRITICAL ISSUES]

  • None found. The core problem (libtmux race between new-session and list-sessions) is real and the fix is pragmatic.

[IMPROVEMENT OPPORTUNITIES]

1. [tmux_terminal.py, Lines 24, 66] The thread lock is redundant

xingyaoww's question is spot-on. According to flock(2) semantics:

"An attempt to lock the file using one of these file descriptors may be denied by a lock that the calling process has already placed via another file descriptor."

This means if thread A holds an exclusive flock() on fd1, thread B trying to acquire an exclusive flock() on fd2 (same file, different fd) will block. The file lock alone handles both inter-process AND intra-thread synchronization.

The thread lock adds:

  • Extra complexity (nested locking)
  • A module-level global that could cause issues with multiprocessing (fork inherits the lock state)
  • No actual benefit

Suggested simplification:

# Remove _TMUX_THREAD_LOCK entirely
# Just use the file lock:
lock_fd = None
try:
    lock_fd = os.open(_TMUX_LOCK_FILE, os.O_CREAT | os.O_RDWR, 0o666)
    fcntl.flock(lock_fd, fcntl.LOCK_EX)
    # ... session creation logic ...
finally:
    if lock_fd is not None:
        fcntl.flock(lock_fd, fcntl.LOCK_UN)
        os.close(lock_fd)

2. [tmux_terminal.py, Line 23] Lock file path could conflict across users

The previous comment about multi-user environments is valid. While mode=0o666 helps with permissions, a shared /tmp/openhands-tmux-session.lock means:

  • User A's agent and User B's agent serialize against each other unnecessarily
  • They're using different tmux servers (different UIDs), so there's no actual race between them

Consider namespacing by UID:

_TMUX_LOCK_FILE = f"/tmp/openhands-tmux-session-{os.getuid()}.lock"

3. [tmux_terminal.py, Lines 71-96] Lock scope is wider than necessary

The lock is held during retry sleeps (0.5s, 1s, 2s). If the race condition is truly in libtmux's new_session()from_session_id() sequence, you only need to serialize the new_session() call itself, not the retries.

However, I'd argue this is acceptable because:

  • Holding the lock during retries prevents thundering herd
  • If one process is failing, others waiting is probably the right behavior
  • The retry logic might be unnecessary entirely if the lock works

[STYLE NOTES]

4. Consider whether retries are still needed

With proper serialization, the race condition should be eliminated. The retry logic might be dead code now. Consider:

  • Keeping it as defense-in-depth (acceptable)
  • Removing it to simplify (preferred if you trust the lock)
  • Adding a comment explaining why it's kept

VERDICT

Worth merging with minor changes: The fix addresses a real production issue. Suggested improvements:

  1. Remove the thread lock — it's redundant and adds complexity
  2. Namespace the lock file by UID — prevents unnecessary cross-user serialization

KEY INSIGHT

The thread lock is solving a problem that flock() already solves. When in doubt, trust the kernel's locking primitives over userspace reimplementations.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OpenHands please do these:

Remove the thread lock — it's redundant and adds complexity
Namespace the lock file by UID — prevents unnecessary cross-user serialization

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on it! xingyaoww can track my progress at all-hands.dev

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've addressed both requested changes and pushed them to the PR branch:

Changes Made

Removed the thread lock - Removed threading.Lock() and the with _TMUX_THREAD_LOCK: wrapper since fcntl.flock() already provides both cross-process AND intra-process synchronization (per flock(2) semantics).

Namespaced the lock file by UID - Changed the lock file path from /tmp/openhands-tmux-session.lock to /tmp/openhands-tmux-session-{os.getuid()}.lock to prevent unnecessary cross-user serialization.

Summary of Changes

  • Removed import threading
  • Removed _TMUX_THREAD_LOCK = threading.Lock()
  • Changed _TMUX_LOCK_FILE to include UID: f"/tmp/openhands-tmux-session-{os.getuid()}.lock"
  • Removed the with _TMUX_THREAD_LOCK: wrapper, keeping only the file lock

The changes have been pushed to the fix/tmux-session-race-condition branch and will update PR #1889.

View full conversation

@openhands-ai
Copy link

openhands-ai bot commented Feb 4, 2026

I've completed the /codereview-roasted for PR #1889, addressing xingyaoww's comment about whether both the thread lock and file lock are necessary.

Summary

I posted a detailed code review comment on the PR at: #1889 (comment)

Key Findings

Taste Rating: 🟡 Acceptable — The fix addresses a real race condition in libtmux, but the locking mechanism is over-engineered.

Main Points:

  1. The thread lock is redundant - xingyaoww's question is correct. According to flock(2) semantics, file locks block even when different threads in the same process open separate file descriptors to the same file. The file lock alone handles both inter-process AND intra-thread synchronization.

  2. Lock file path should be namespaced by UID - A shared /tmp/openhands-tmux-session.lock causes unnecessary serialization between different users who are using different tmux servers anyway.

  3. Lock scope during retries is acceptable - While the lock is held during retry sleeps, this prevents thundering herd and is reasonable behavior.

Recommendations:

  • Remove _TMUX_THREAD_LOCK entirely (it adds complexity with no benefit)
  • Change lock file to f"/tmp/openhands-tmux-session-{os.getuid()}.lock"

Verdict: ✅ Worth merging with the suggested minor simplifications.

No code changes were made to the branch - this was a review-only task as requested.

View full conversation

- Remove redundant thread lock (fcntl.flock provides both cross-process
  and intra-process synchronization)
- Namespace lock file by UID to prevent unnecessary cross-user serialization
Copy link
Collaborator

@xingyaoww xingyaoww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This version LGTM (i asked OH to make some changes) - @neubig feel free to merge if it also looks good to you

@xingyaoww xingyaoww added the test-examples Run all applicable "examples/" files. Expensive operation. label Feb 4, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Feb 4, 2026

🔄 Running Examples with openhands/claude-haiku-4-5-20251001

Generated: 2026-02-04 15:10:40 UTC

Example Status Duration Cost
01_standalone_sdk/02_custom_tools.py ✅ PASS 24.2s $0.03
01_standalone_sdk/03_activate_skill.py ✅ PASS 21.7s $0.03
01_standalone_sdk/05_use_llm_registry.py ✅ PASS 9.4s $0.01
01_standalone_sdk/07_mcp_integration.py ✅ PASS 30.6s $0.03
01_standalone_sdk/09_pause_example.py ✅ PASS 12.1s $0.01
01_standalone_sdk/10_persistence.py ✅ PASS 24.8s $0.02
01_standalone_sdk/11_async.py ✅ PASS 34.6s $0.03
01_standalone_sdk/12_custom_secrets.py ✅ PASS 13.9s $0.01
01_standalone_sdk/13_get_llm_metrics.py ✅ PASS 21.2s $0.02
01_standalone_sdk/14_context_condenser.py ✅ PASS 4m 14s $0.53
01_standalone_sdk/17_image_input.py ✅ PASS 15.0s $0.02
01_standalone_sdk/18_send_message_while_processing.py ✅ PASS 26.0s $0.02
01_standalone_sdk/19_llm_routing.py ✅ PASS 11.3s $0.02
01_standalone_sdk/20_stuck_detector.py ✅ PASS 13.5s $0.02
01_standalone_sdk/21_generate_extraneous_conversation_costs.py ✅ PASS 8.8s $0.00
01_standalone_sdk/22_anthropic_thinking.py ✅ PASS 16.8s $0.01
01_standalone_sdk/23_responses_reasoning.py ✅ PASS 52.1s $0.01
01_standalone_sdk/24_planning_agent_workflow.py ✅ PASS 5m 15s $0.38
01_standalone_sdk/25_agent_delegation.py ❌ FAIL
Timed out after 600 seconds
10m 0s $0.56
01_standalone_sdk/26_custom_visualizer.py ✅ PASS 23.7s $0.02
01_standalone_sdk/28_ask_agent_example.py ✅ PASS 39.8s $0.02
01_standalone_sdk/29_llm_streaming.py ✅ PASS 57.7s $0.02
01_standalone_sdk/30_tom_agent.py ❌ FAIL
Exit code 1
2.4s --
01_standalone_sdk/31_iterative_refinement.py ✅ PASS 4m 10s $0.31
01_standalone_sdk/32_configurable_security_policy.py ✅ PASS 15.7s $0.02
01_standalone_sdk/34_critic_example.py ✅ PASS 9.7s $0.00
02_remote_agent_server/01_convo_with_local_agent_server.py ✅ PASS 40.4s $0.03
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py ❌ FAIL
Exit code 1
40.8s --
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py ❌ FAIL
Exit code 1
27.5s --
02_remote_agent_server/04_convo_with_api_sandboxed_server.py ❌ FAIL
Exit code 1
5m 11s --
02_remote_agent_server/05_vscode_with_docker_sandboxed_server.py ❌ FAIL
Exit code 1
12.7s --
02_remote_agent_server/07_convo_with_cloud_workspace.py ✅ PASS 45.0s $0.03
02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py ❌ FAIL
Exit code 1
2m 39s --
04_llm_specific_tools/01_gpt5_apply_patch_preset.py ✅ PASS 26.0s $0.02
04_llm_specific_tools/02_gemini_file_tools.py ✅ PASS 1m 9s $0.06
05_skills_and_plugins/01_loading_agentskills/main.py ✅ PASS 8.9s $0.01
05_skills_and_plugins/02_loading_plugins/main.py ✅ PASS 5.4s $0.01

❌ Some tests failed

Total: 37 | Passed: 30 | Failed: 7 | Total Cost: $2.34

Failed examples:

  • examples/01_standalone_sdk/25_agent_delegation.py: Timed out after 600 seconds
  • examples/01_standalone_sdk/30_tom_agent.py: Exit code 1
  • examples/02_remote_agent_server/02_convo_with_docker_sandboxed_server.py: Exit code 1
  • examples/02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py: Exit code 1
  • examples/02_remote_agent_server/04_convo_with_api_sandboxed_server.py: Exit code 1
  • examples/02_remote_agent_server/05_vscode_with_docker_sandboxed_server.py: Exit code 1
  • examples/02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py: Exit code 1

View full workflow run

@xingyaoww
Copy link
Collaborator

@OpenHands please investigate #1889 (comment) and tell me WHY the example failed for 02_remote_agent_server/02_convo_with_docker_sandboxed_server.py?

@openhands-ai
Copy link

openhands-ai bot commented Feb 4, 2026

I'm on it! xingyaoww can track my progress at all-hands.dev

@openhands-ai
Copy link

openhands-ai bot commented Feb 4, 2026

Why the example 02_convo_with_docker_sandboxed_server.py failed

The failure is NOT related to the tmux session race condition fix in this PR. It's a schema mismatch between the SDK client and the agent server.

Root Cause

The error from the logs is:

HTTP request failed (422 Unprocessable Content): {'detail': [
  {'type': 'extra_forbidden', 'loc': ['body', 'content', 0, 'TextContent', 'enable_truncation'], 
   'msg': 'Extra inputs are not permitted', 'input': True},
  ...
]}

What's happening:

  1. The example script uses a hardcoded image: ghcr.io/openhands/agent-server:latest-python (line 50)
  2. This latest-python tag points to the latest released version of the agent server
  3. The SDK code being tested is from the PR branch, which includes a new field enable_truncation in TextContent
  4. When the SDK client sends a message with enable_truncation: True, the older agent server rejects it as an unknown field (HTTP 422)

Why other examples work

  • 04_convo_with_api_sandboxed_server.py uses GITHUB_SHA to dynamically select the PR-specific image:
    server_image_sha = os.getenv("GITHUB_SHA") or "main"
    server_image = f"ghcr.io/openhands/agent-server:{server_image_sha[:7]}-python-amd64"
  • 07_convo_with_cloud_workspace.py passed because it uses a cloud workspace that likely has the updated server

The fix needed (not in this PR's scope)

The Docker-based examples (02_convo_with_docker_sandboxed_server.py, 03_browser_use_with_docker_sandboxed_server.py, 05_vscode_with_docker_sandboxed_server.py, 08_convo_with_apptainer_sandboxed_server.py) should be updated to use the PR-specific image tag similar to how 04_convo_with_api_sandboxed_server.py does it, or use an environment variable to override the image.

Summary

This is a pre-existing test infrastructure issue where some examples use latest-python instead of the PR-specific image tag. The tmux race condition fix in this PR is unrelated to this failure.

View full conversation

xingyaoww and others added 2 commits February 4, 2026 23:18
When running in CI (GITHUB_SHA is set), use the PR-specific agent-server
image tag to ensure the SDK client and server are in sync. This prevents
schema mismatch errors like 'enable_truncation' field not being recognized
by the server.

The examples now use the same pattern as 04_convo_with_api_sandboxed_server.py.

Co-authored-by: openhands <[email protected]>
@xingyaoww xingyaoww removed the test-examples Run all applicable "examples/" files. Expensive operation. label Feb 4, 2026
@xingyaoww xingyaoww added the test-examples Run all applicable "examples/" files. Expensive operation. label Feb 4, 2026 — with OpenHands AI
@github-actions
Copy link
Contributor

github-actions bot commented Feb 4, 2026

🔄 Running Examples with openhands/claude-haiku-4-5-20251001

Generated: 2026-02-04 15:46:02 UTC

Example Status Duration Cost
01_standalone_sdk/02_custom_tools.py ✅ PASS 27.6s $0.03
01_standalone_sdk/03_activate_skill.py ✅ PASS 40.8s $0.03
01_standalone_sdk/05_use_llm_registry.py ✅ PASS 10.5s $0.01
01_standalone_sdk/07_mcp_integration.py ✅ PASS 30.7s $0.03
01_standalone_sdk/09_pause_example.py ✅ PASS 19.4s $0.01
01_standalone_sdk/10_persistence.py ✅ PASS 31.6s $0.04
01_standalone_sdk/11_async.py ✅ PASS 34.8s $0.04
01_standalone_sdk/12_custom_secrets.py ✅ PASS 15.7s $0.02
01_standalone_sdk/13_get_llm_metrics.py ✅ PASS 22.6s $0.03
01_standalone_sdk/14_context_condenser.py ❌ FAIL
Timed out after 600 seconds
10m 0s --
01_standalone_sdk/17_image_input.py ✅ PASS 12.7s $0.02
01_standalone_sdk/18_send_message_while_processing.py ✅ PASS 17.1s $0.01
01_standalone_sdk/19_llm_routing.py ✅ PASS 16.5s $0.02
01_standalone_sdk/20_stuck_detector.py ✅ PASS 11.8s $0.02
01_standalone_sdk/21_generate_extraneous_conversation_costs.py ✅ PASS 10.7s $0.00
01_standalone_sdk/22_anthropic_thinking.py ✅ PASS 24.6s $0.02
01_standalone_sdk/23_responses_reasoning.py ✅ PASS 1m 18s $0.02
01_standalone_sdk/24_planning_agent_workflow.py ✅ PASS 5m 31s $0.45
01_standalone_sdk/25_agent_delegation.py ❌ FAIL
Timed out after 600 seconds
10m 0s $0.49
01_standalone_sdk/26_custom_visualizer.py ✅ PASS 19.9s $0.02
01_standalone_sdk/28_ask_agent_example.py ✅ PASS 26.5s $0.03
01_standalone_sdk/29_llm_streaming.py ✅ PASS 37.8s $0.03
01_standalone_sdk/30_tom_agent.py ❌ FAIL
Exit code 1
2.1s --
01_standalone_sdk/31_iterative_refinement.py ✅ PASS 3m 46s $0.29
01_standalone_sdk/32_configurable_security_policy.py ✅ PASS 17.4s $0.02
01_standalone_sdk/34_critic_example.py ✅ PASS 8.1s $0.00
02_remote_agent_server/01_convo_with_local_agent_server.py ✅ PASS 53.3s $0.05
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py ❌ FAIL
Exit code 1
39.1s --
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py ❌ FAIL
Exit code 1
12.7s --
02_remote_agent_server/04_convo_with_api_sandboxed_server.py ❌ FAIL
Exit code 1
1m 51s --
02_remote_agent_server/05_vscode_with_docker_sandboxed_server.py ❌ FAIL
Exit code 1
16.3s --
02_remote_agent_server/07_convo_with_cloud_workspace.py ✅ PASS 28.1s $0.02
02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py ❌ FAIL
Exit code 1
2m 42s --
04_llm_specific_tools/01_gpt5_apply_patch_preset.py ✅ PASS 25.9s $0.02
04_llm_specific_tools/02_gemini_file_tools.py ✅ PASS 55.9s $0.06
05_skills_and_plugins/01_loading_agentskills/main.py ✅ PASS 14.6s $0.02
05_skills_and_plugins/02_loading_plugins/main.py ✅ PASS 6.5s $0.01

❌ Some tests failed

Total: 37 | Passed: 29 | Failed: 8 | Total Cost: $1.84

Failed examples:

  • examples/01_standalone_sdk/14_context_condenser.py: Timed out after 600 seconds
  • examples/01_standalone_sdk/25_agent_delegation.py: Timed out after 600 seconds
  • examples/01_standalone_sdk/30_tom_agent.py: Exit code 1
  • examples/02_remote_agent_server/02_convo_with_docker_sandboxed_server.py: Exit code 1
  • examples/02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py: Exit code 1
  • examples/02_remote_agent_server/04_convo_with_api_sandboxed_server.py: Exit code 1
  • examples/02_remote_agent_server/05_vscode_with_docker_sandboxed_server.py: Exit code 1
  • examples/02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py: Exit code 1

View full workflow run

@openhands-ai
Copy link

openhands-ai bot commented Feb 4, 2026

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Agent Server

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1889 at branch `fix/tmux-session-race-condition`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

Copy link
Collaborator

Investigation: Why 02_convo_with_docker_sandboxed_server.py Failed

Root Cause Analysis

The failure is NOT related to the schema mismatch issue I fixed earlier. After my fix, the examples now correctly use the PR-specific agent-server image (ghcr.io/openhands/agent-server:aedb881-python-amd64).

The actual failure is the tmux race condition that this PR is supposed to fix:

[DOCKER] Tmux session creation failed (attempt 1/3), retrying in 0.5s: Could not find object
[DOCKER] Tmux session creation failed (attempt 2/3), retrying in 1.0s: Could not find object
[DOCKER] RuntimeError: Failed to create tmux session after 3 attempts

Key Observations

  1. The file-based locking IS in the image - The PR's changes are included in the merge commit aedb881

  2. The race condition is happening within a single session creation - There's only ONE conversation being created, but all 3 retry attempts fail. This suggests the issue is not cross-process contention (which the file lock addresses), but rather an issue within libtmux itself or the Docker environment.

  3. Local agent server works, Docker fails:

    • 01_convo_with_local_agent_server.py - ✅ PASSED
    • 02_convo_with_docker_sandboxed_server.py - ❌ FAILED
  4. Timing analysis from logs:

    • 15:29:36,632 - TmuxTerminal detected
    • 15:29:36,642 - Attempt 1 fails (10ms later)
    • 15:29:37,153 - Attempt 2 fails (511ms later)
    • 15:29:38,164 - Attempt 3 fails (1011ms later)

    All 3 attempts fail consistently, suggesting a fundamental issue rather than a timing race.

Hypotheses

  1. Docker-specific tmux issue: The tmux server in the Docker container might not be fully initialized or might have different behavior than on the host.

  2. libtmux internal race: The new_session() method creates a session then immediately queries for it. Even with external locking, this internal race can still occur.

  3. Environment differences: Something in the Docker container's environment might be affecting tmux behavior.

Next Steps

I'm going to:

  1. Pull the Docker image locally and investigate tmux behavior inside the container
  2. Review git commits since the 1.10.0 release to identify any changes that could have introduced this issue

Copy link
Collaborator

🔍 Root Cause Found: Zombie Processes in Docker Container

I've reproduced and identified the root cause of the tmux session creation failures.

The Problem

When running the agent-server in Docker, tmux server processes become zombies and are not properly reaped:

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
openhan+       1  0.9  0.0   2856  1920 ?        Ss   15:57   0:00 /usr/local/bin/openhands-agent-server --host 0.0.0.0 --port 8000
openhan+      90  0.0  0.0      0     0 ?        Zs   15:58   0:00 [tmux: server] <defunct>
openhan+     100  0.0  0.0      0     0 ?        Zs   15:58   0:00 [tmux: server] <defunct>

Why This Happens

  1. PID 1 is the agent-server binary - In Docker, the main process runs as PID 1
  2. PID 1 has special responsibilities - It must handle SIGCHLD signals to reap zombie child processes
  3. The agent-server binary doesn't reap children - When tmux spawns server processes, they become zombies after exiting
  4. libtmux race condition - When new_session() creates a session, the tmux server process may exit and become a zombie before libtmux can query it, causing TmuxObjectDoesNotExist

Evidence

  • The tmux session IS created (visible via tmux list-sessions)
  • But libtmux can't find it immediately after creation
  • Multiple zombie [tmux: server] <defunct> processes accumulate

Solution Options

  1. Use --init flag when running Docker - This adds a proper init process (tini) as PID 1:

    docker run --init -d ... ghcr.io/openhands/agent-server:...
  2. Add tini to the Docker image - Include tini in the Dockerfile and use it as the entrypoint

  3. Handle SIGCHLD in the agent-server - Add signal handling to reap zombie processes

Immediate Fix for Examples

The Docker-based examples should use --init flag. I'll update the DockerWorkspace to include this.

Note

This is NOT a bug in the tmux locking code in this PR. The file-based locking is working correctly. The issue is that the Docker container environment doesn't properly handle child process cleanup, which causes the libtmux race condition to manifest even for single session creation.

@xingyaoww
Copy link
Collaborator

Seems like this is not the root fix, but i think this is a net improvement, so merging it..

@xingyaoww xingyaoww merged commit 20291bd into main Feb 4, 2026
47 of 49 checks passed
@xingyaoww xingyaoww deleted the fix/tmux-session-race-condition branch February 4, 2026 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test-examples Run all applicable "examples/" files. Expensive operation.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants