fix: serialize tmux session creation to prevent race conditions by neubig · Pull Request #1889 · OpenHands/software-agent-sdk

neubig · 2026-02-03T18:25:32Z

Problem

When multiple parallel agents share the same Docker workspace (same container), they all attempt to create tmux sessions on the same tmux server simultaneously. The libtmux library has a race condition where new_session() creates a session but then fails to find it immediately, throwing TmuxObjectDoesNotExist.

This causes agents to fail with:

Failed to create tmux session after 3 attempts

Solution

Use file-based locking (fcntl.flock()) to serialize tmux session creation within the container. This ensures only one session is created at a time, eliminating the race condition.

Changes

Added _TMUX_LOCK_FILE = "/tmp/openhands-tmux-session.lock"
Added _TMUX_THREAD_LOCK = threading.Lock() for intra-process safety
Wrapped session creation with both locks (file lock for cross-process, thread lock for same-process)

Testing

Tested with 20 parallel conversations in a single Docker workspace - all tmux sessions created successfully (0 failures).

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.13-nodejs22`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:aedb881-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-aedb881-python \
  ghcr.io/openhands/agent-server:aedb881-python

All tags pushed for this build

ghcr.io/openhands/agent-server:aedb881-golang-amd64
ghcr.io/openhands/agent-server:aedb881-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:aedb881-golang-arm64
ghcr.io/openhands/agent-server:aedb881-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:aedb881-java-amd64
ghcr.io/openhands/agent-server:aedb881-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:aedb881-java-arm64
ghcr.io/openhands/agent-server:aedb881-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:aedb881-python-amd64
ghcr.io/openhands/agent-server:aedb881-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-amd64
ghcr.io/openhands/agent-server:aedb881-python-arm64
ghcr.io/openhands/agent-server:aedb881-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-arm64
ghcr.io/openhands/agent-server:aedb881-golang
ghcr.io/openhands/agent-server:aedb881-java
ghcr.io/openhands/agent-server:aedb881-python

About Multi-Architecture Support

Each variant tag (e.g., aedb881-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., aedb881-python-amd64) are also available if needed

When many parallel agents start simultaneously, they all attempt to create tmux sessions at once. The libtmux library has race conditions where new_session() can create a session but then fail to find it immediately, throwing TmuxObjectDoesNotExist. This fix adds a global threading.Lock to serialize tmux session creation, eliminating the race condition. Co-authored-by: openhands <[email protected]>

…nization The threading.Lock() only works within a single process. When uvicorn spawns multiple workers or when there are multiple Python processes, each has its own lock. This change uses fcntl.flock() on a shared lock file, which works across processes since they all share the same filesystem.

…ialization Use fcntl.flock() for cross-process synchronization of tmux session creation. This prevents race conditions when multiple processes (e.g., uvicorn workers) try to create tmux sessions simultaneously. The file lock is combined with a threading lock to handle both: - Multi-process concurrency (file lock) - Multi-thread concurrency within same process (thread lock)

Serialize tmux session creation using fcntl.flock() to prevent race conditions when multiple agents share the same Docker workspace. The libtmux library has a race condition where new_session() can create a session but then fail to find it immediately. This fix ensures only one session is created at a time within the container. Co-authored-by: openhands <[email protected]>

github-actions · 2026-02-03T21:00:27Z

Coverage Report •

File	Stmts	Miss	Cover	Missing
openhands-tools/openhands/tools/terminal/terminal
tmux_terminal.py	113	31	72%	51, 58, 83–86, 90–91, 93, 134, 138, 142, 153, 164, 178, 190–197, 205–206, 208–209, 211, 213–215
TOTAL	17461	8743	49%

all-hands-bot

Good approach to fixing the race condition! The dual-lock strategy (thread + file) is sound. Found a few issues that should be addressed before merging.

all-hands-bot · 2026-02-03T21:02:01Z

openhands-tools/openhands/tools/terminal/terminal/tmux_terminal.py


+# Lock for serializing tmux session creation to prevent race conditions in libtmux
+# Uses both file lock (cross-process) and thread lock (same-process)
+_TMUX_LOCK_FILE = "/tmp/openhands-tmux-session.lock"


🟠 Important: Lock file permissions could cause issues in multi-user environments.

If user A creates this file first, user B may not have permission to access it. Consider using a user-specific lock file path:

Suggested change

_TMUX_LOCK_FILE = "/tmp/openhands-tmux-session.lock"

_TMUX_LOCK_FILE = f"/tmp/openhands-tmux-session-{os.getenv(\"USER\", \"default\")}.lock"

Alternatively, ensure the lock file has permissive permissions (0o666) when created.

^this could be real

Also if we use file-based lock, why do we need thread lock? Are both of them necessary at the same time?

all-hands-bot · 2026-02-03T21:02:01Z

openhands-tools/openhands/tools/terminal/terminal/tmux_terminal.py

+            lock_fd = os.open(_TMUX_LOCK_FILE, os.O_CREAT | os.O_RDWR)
            try:


🟠 Important: Potential file descriptor leak if an exception occurs between os.open() and the try block.

Move the os.open() inside the try block to ensure the fd is always closed:

Suggested change

lock_fd = os.open(_TMUX_LOCK_FILE, os.O_CREAT | os.O_RDWR)

try:

try:

lock_fd = os.open(_TMUX_LOCK_FILE, os.O_CREAT | os.O_RDWR)

all-hands-bot · 2026-02-03T21:02:01Z

openhands-tools/openhands/tools/terminal/terminal/tmux_terminal.py

-        for attempt in range(max_retries):
+        # Serialize tmux session creation to prevent libtmux race conditions
+        with _TMUX_THREAD_LOCK:
+            lock_fd = os.open(_TMUX_LOCK_FILE, os.O_CREAT | os.O_RDWR)


🟡 Suggestion: Consider setting explicit permissions on the lock file for better multi-user compatibility.

You could add mode=0o666 or use os.chmod() after creation to ensure all users can access the lock file.

all-hands-bot · 2026-02-03T21:02:01Z

openhands-tools/openhands/tools/terminal/terminal/tmux_terminal.py

+        # Serialize tmux session creation to prevent libtmux race conditions
+        with _TMUX_THREAD_LOCK:
+            lock_fd = os.open(_TMUX_LOCK_FILE, os.O_CREAT | os.O_RDWR)
            try:
-                self.session = self.server.new_session(
-                    session_name=session_name,
-                    start_directory=self.work_dir,
-                    kill_session=True,
-                    x=1000,
-                    y=1000,
-                )
-                break
-            except TmuxObjectDoesNotExist as e:
-                last_error = e
-                if attempt < max_retries - 1:
-                    logger.warning(
-                        f"Tmux session creation failed (attempt {attempt + 1}/"
-                        f"{max_retries}), retrying in {retry_delay}s: {e}"
-                    )
-                    time.sleep(retry_delay)
-                    retry_delay *= 2  # Exponential backoff
-        else:
-            raise RuntimeError(
-                f"Failed to create tmux session after {max_retries} attempts"
-            ) from last_error
+                fcntl.flock(lock_fd, fcntl.LOCK_EX)
+                max_retries = 3
+                retry_delay = 0.5
+                last_error = None
+                for attempt in range(max_retries):
+                    try:
+                        self.session = self.server.new_session(
+                            session_name=session_name,
+                            start_directory=self.work_dir,
+                            kill_session=True,
+                            x=1000,
+                            y=1000,
+                        )
+                        break
+                    except TmuxObjectDoesNotExist as e:
+                        last_error = e
+                        if attempt < max_retries - 1:
+                            logger.warning(
+                                f"Tmux session creation failed (attempt {attempt + 1}/"
+                                f"{max_retries}), retrying in {retry_delay}s: {e}"
+                            )
+                            time.sleep(retry_delay)
+                            retry_delay *= 2
+                else:
+                    raise RuntimeError(
+                        f"Failed to create tmux session after {max_retries} attempts"
+                    ) from last_error
+            finally:
+                fcntl.flock(lock_fd, fcntl.LOCK_UN)
+                os.close(lock_fd)


🟡 Suggestion: The retry logic might be redundant now with proper locking in place.

The retries were originally added to work around the race condition that the locks now prevent. Consider removing the retry logic once this is proven stable in production, as it adds complexity and delay for genuine errors.

That said, keeping it as a defensive measure is reasonable if you want extra safety.

all-hands-bot · 2026-02-03T21:02:01Z

openhands-tools/openhands/tools/terminal/terminal/tmux_terminal.py

-        last_error = None
-        for attempt in range(max_retries):
+        # Serialize tmux session creation to prevent libtmux race conditions
+        with _TMUX_THREAD_LOCK:


🟢 Nit: Lock ordering is correct (thread lock before file lock). Good practice to avoid potential deadlocks.

neubig · 2026-02-03T22:21:53Z

@OpenHands reflect all the review comments, and try to keep the overall PR changes as concise as possible

openhands-ai · 2026-02-03T22:22:02Z

I'm on it! neubig can track my progress at all-hands.dev

- Move os.open() inside try block to prevent fd leak on exception - Add mode=0o666 for multi-user compatibility

openhands-ai · 2026-02-03T22:24:07Z

I've addressed the review comments on PR #1889 and pushed the changes. Here's a summary:

Review Comments Addressed

✅ Important: Potential file descriptor leak - Moved os.open() inside the try block and added lock_fd = None initialization to ensure the fd is always properly closed even if an exception occurs.
✅ Important: Lock file permissions for multi-user environments - Added mode=0o666 to the os.open() call to ensure all users can access the lock file.
ℹ️ Suggestion: Retry logic redundancy - Kept the retry logic as a defensive measure per the reviewer's note that "keeping it as a defensive measure is reasonable."
✅ Nit: Lock ordering - No change needed; the reviewer confirmed the lock ordering (thread lock before file lock) is correct.

Changes Made

The commit adds only 5 lines and removes 3 lines, keeping the changes minimal and focused:

lock_fd = None initialization before try block
os.open() moved inside try block with 0o666 mode
Added if lock_fd is not None: check in finally block

The changes have been pushed to the fix/tmux-session-race-condition branch.

View full conversation

enyst · 2026-02-04T11:09:18Z

@OpenHands Do a /codereview-roasted on this PR. Do not exaggerate, investigate tmux and understand the problem. You could also try to replicate it in your environment.

Post your feedback as a comment on the PR. Note that it will be rendered as markdown.

openhands-ai · 2026-02-04T11:09:27Z

I'm on it! enyst can track my progress at all-hands.dev

enyst · 2026-02-04T11:12:57Z

/codereview-roasted

Taste rating: 🟡 Acceptable — fixes a real race, but the locking is a bit heavy‑handed and could be safer.

Linus‑style analysis

[CRITICAL ISSUES]

None found. The core idea (serialize new_session() because libtmux races) is pragmatic and doesn’t break userspace.

[IMPROVEMENT OPPORTUNITIES]

[openhands-tools/openhands/tools/terminal/terminal/tmux_terminal.py, Lines 65–100] Lock scope is wider than needed: the file lock is held across retries and backoff sleeps. That means one slow/failed creator blocks every other session for up to ~1.5s per retry. Consider only holding the lock for the actual new_session() call (release before sleep, reacquire before retry) so other processes aren’t stalled unnecessarily.
[openhands-tools/openhands/tools/terminal/terminal/tmux_terminal.py, Line 69] FD inheritance risk: os.open(..., os.O_CREAT | os.O_RDWR) does not set O_CLOEXEC. If libtmux spawns subprocesses (it does), the lock fd can leak into children and keep the lock held longer than expected. Use os.O_CLOEXEC or os.set_inheritable(lock_fd, False) after open.
[openhands-tools/openhands/tools/terminal/terminal/tmux_terminal.py, Line 23] Global lock namespace: the lock file is global (/tmp/openhands-tmux-session.lock) regardless of tmux socket/server. If we ever run multiple tmux servers in one container (custom socket names, tests, or user isolation), this serializes unrelated servers. Consider namespacing by tmux socket path or username to avoid unnecessary contention.

[STYLE NOTES]

[openhands-tools/openhands/tools/terminal/terminal/tmux_terminal.py, Lines 21–24] The comment is fine, but the behavior would be clearer if the retry constants were lifted to module constants alongside the lock name for visibility.

Repro note
I tried to reproduce the libtmux race locally by spawning 20–60 parallel new_session() calls via multiprocessing and didn’t hit failures. That doesn’t disprove the issue — just means it’s load/latency sensitive and likely depends on the container’s tmux server state.

Verdict: ✅ Worth merging, but please consider tightening the lock scope and preventing FD inheritance to avoid hidden stalls.

Key insight: This is the right fix direction, but the lock should be as small and as safe as possible — otherwise we trade a rare race for steady contention.

xingyaoww · 2026-02-04T14:34:48Z

@OpenHands do a /codereview-roasted, consider these comment i raised: #1889 (comment)

openhands-ai · 2026-02-04T14:34:58Z

I'm on it! xingyaoww can track my progress at all-hands.dev

xingyaoww · 2026-02-04T14:37:44Z

openhands-tools/openhands/tools/terminal/terminal/tmux_terminal.py

+# Lock for serializing tmux session creation to prevent race conditions in libtmux
+# Uses both file lock (cross-process) and thread lock (same-process)
+_TMUX_LOCK_FILE = "/tmp/openhands-tmux-session.lock"
+_TMUX_THREAD_LOCK = threading.Lock()


/codereview-roasted

Taste rating: 🟡 Acceptable — fixes a real race condition, but the locking mechanism is over-engineered.

Linus-Style Analysis

[CRITICAL ISSUES]

None found. The core problem (libtmux race between new-session and list-sessions) is real and the fix is pragmatic.

[IMPROVEMENT OPPORTUNITIES]

1. [tmux_terminal.py, Lines 24, 66] The thread lock is redundant

xingyaoww's question is spot-on. According to flock(2) semantics:

"An attempt to lock the file using one of these file descriptors may be denied by a lock that the calling process has already placed via another file descriptor."

This means if thread A holds an exclusive flock() on fd1, thread B trying to acquire an exclusive flock() on fd2 (same file, different fd) will block. The file lock alone handles both inter-process AND intra-thread synchronization.

The thread lock adds:

Extra complexity (nested locking)

A module-level global that could cause issues with multiprocessing (fork inherits the lock state)

No actual benefit

Suggested simplification:

# Remove _TMUX_THREAD_LOCK entirely # Just use the file lock: lock_fd = None try: lock_fd = os.open(_TMUX_LOCK_FILE, os.O_CREAT | os.O_RDWR, 0o666) fcntl.flock(lock_fd, fcntl.LOCK_EX) # ... session creation logic ... finally: if lock_fd is not None: fcntl.flock(lock_fd, fcntl.LOCK_UN) os.close(lock_fd)

2. [tmux_terminal.py, Line 23] Lock file path could conflict across users

The previous comment about multi-user environments is valid. While mode=0o666 helps with permissions, a shared /tmp/openhands-tmux-session.lock means:

User A's agent and User B's agent serialize against each other unnecessarily

They're using different tmux servers (different UIDs), so there's no actual race between them

Consider namespacing by UID:

_TMUX_LOCK_FILE = f"/tmp/openhands-tmux-session-{os.getuid()}.lock"

3. [tmux_terminal.py, Lines 71-96] Lock scope is wider than necessary

The lock is held during retry sleeps (0.5s, 1s, 2s). If the race condition is truly in libtmux's new_session() → from_session_id() sequence, you only need to serialize the new_session() call itself, not the retries.

However, I'd argue this is acceptable because:

Holding the lock during retries prevents thundering herd

If one process is failing, others waiting is probably the right behavior

The retry logic might be unnecessary entirely if the lock works

[STYLE NOTES]

4. Consider whether retries are still needed

With proper serialization, the race condition should be eliminated. The retry logic might be dead code now. Consider:

Keeping it as defense-in-depth (acceptable)

Removing it to simplify (preferred if you trust the lock)

Adding a comment explaining why it's kept

VERDICT

✅ Worth merging with minor changes: The fix addresses a real production issue. Suggested improvements:

Remove the thread lock — it's redundant and adds complexity

Namespace the lock file by UID — prevents unnecessary cross-user serialization

KEY INSIGHT

The thread lock is solving a problem that flock() already solves. When in doubt, trust the kernel's locking primitives over userspace reimplementations.

@OpenHands please do these:

Remove the thread lock — it's redundant and adds complexity
Namespace the lock file by UID — prevents unnecessary cross-user serialization

I'm on it! xingyaoww can track my progress at all-hands.dev

I've addressed both requested changes and pushed them to the PR branch:

Changes Made

✅ Removed the thread lock - Removed threading.Lock() and the with _TMUX_THREAD_LOCK: wrapper since fcntl.flock() already provides both cross-process AND intra-process synchronization (per flock(2) semantics).

✅ Namespaced the lock file by UID - Changed the lock file path from /tmp/openhands-tmux-session.lock to /tmp/openhands-tmux-session-{os.getuid()}.lock to prevent unnecessary cross-user serialization.

Summary of Changes

Removed import threading

Removed _TMUX_THREAD_LOCK = threading.Lock()

Changed _TMUX_LOCK_FILE to include UID: f"/tmp/openhands-tmux-session-{os.getuid()}.lock"

Removed the with _TMUX_THREAD_LOCK: wrapper, keeping only the file lock

The changes have been pushed to the fix/tmux-session-race-condition branch and will update PR #1889.

View full conversation

openhands-ai · 2026-02-04T14:38:17Z

I've completed the /codereview-roasted for PR #1889, addressing xingyaoww's comment about whether both the thread lock and file lock are necessary.

Summary

I posted a detailed code review comment on the PR at: #1889 (comment)

Key Findings

Taste Rating: 🟡 Acceptable — The fix addresses a real race condition in libtmux, but the locking mechanism is over-engineered.

Main Points:

The thread lock is redundant - xingyaoww's question is correct. According to flock(2) semantics, file locks block even when different threads in the same process open separate file descriptors to the same file. The file lock alone handles both inter-process AND intra-thread synchronization.
Lock file path should be namespaced by UID - A shared /tmp/openhands-tmux-session.lock causes unnecessary serialization between different users who are using different tmux servers anyway.
Lock scope during retries is acceptable - While the lock is held during retry sleeps, this prevents thundering herd and is reasonable behavior.

Recommendations:

Remove _TMUX_THREAD_LOCK entirely (it adds complexity with no benefit)
Change lock file to f"/tmp/openhands-tmux-session-{os.getuid()}.lock"

Verdict: ✅ Worth merging with the suggested minor simplifications.

No code changes were made to the branch - this was a review-only task as requested.

View full conversation

- Remove redundant thread lock (fcntl.flock provides both cross-process and intra-process synchronization) - Namespace lock file by UID to prevent unnecessary cross-user serialization

xingyaoww

This version LGTM (i asked OH to make some changes) - @neubig feel free to merge if it also looks good to you

github-actions · 2026-02-04T14:55:33Z

🔄 Running Examples with `openhands/claude-haiku-4-5-20251001`

Generated: 2026-02-04 15:10:40 UTC

Example	Status	Duration	Cost
01_standalone_sdk/02_custom_tools.py	✅ PASS	24.2s	$0.03
01_standalone_sdk/03_activate_skill.py	✅ PASS	21.7s	$0.03
01_standalone_sdk/05_use_llm_registry.py	✅ PASS	9.4s	$0.01
01_standalone_sdk/07_mcp_integration.py	✅ PASS	30.6s	$0.03
01_standalone_sdk/09_pause_example.py	✅ PASS	12.1s	$0.01
01_standalone_sdk/10_persistence.py	✅ PASS	24.8s	$0.02
01_standalone_sdk/11_async.py	✅ PASS	34.6s	$0.03
01_standalone_sdk/12_custom_secrets.py	✅ PASS	13.9s	$0.01
01_standalone_sdk/13_get_llm_metrics.py	✅ PASS	21.2s	$0.02
01_standalone_sdk/14_context_condenser.py	✅ PASS	4m 14s	$0.53
01_standalone_sdk/17_image_input.py	✅ PASS	15.0s	$0.02
01_standalone_sdk/18_send_message_while_processing.py	✅ PASS	26.0s	$0.02
01_standalone_sdk/19_llm_routing.py	✅ PASS	11.3s	$0.02
01_standalone_sdk/20_stuck_detector.py	✅ PASS	13.5s	$0.02
01_standalone_sdk/21_generate_extraneous_conversation_costs.py	✅ PASS	8.8s	$0.00
01_standalone_sdk/22_anthropic_thinking.py	✅ PASS	16.8s	$0.01
01_standalone_sdk/23_responses_reasoning.py	✅ PASS	52.1s	$0.01
01_standalone_sdk/24_planning_agent_workflow.py	✅ PASS	5m 15s	$0.38
01_standalone_sdk/25_agent_delegation.py	❌ FAIL Timed out after 600 seconds	10m 0s	$0.56
01_standalone_sdk/26_custom_visualizer.py	✅ PASS	23.7s	$0.02
01_standalone_sdk/28_ask_agent_example.py	✅ PASS	39.8s	$0.02
01_standalone_sdk/29_llm_streaming.py	✅ PASS	57.7s	$0.02
01_standalone_sdk/30_tom_agent.py	❌ FAIL Exit code 1	2.4s	--
01_standalone_sdk/31_iterative_refinement.py	✅ PASS	4m 10s	$0.31
01_standalone_sdk/32_configurable_security_policy.py	✅ PASS	15.7s	$0.02
01_standalone_sdk/34_critic_example.py	✅ PASS	9.7s	$0.00
02_remote_agent_server/01_convo_with_local_agent_server.py	✅ PASS	40.4s	$0.03
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py	❌ FAIL Exit code 1	40.8s	--
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py	❌ FAIL Exit code 1	27.5s	--
02_remote_agent_server/04_convo_with_api_sandboxed_server.py	❌ FAIL Exit code 1	5m 11s	--
02_remote_agent_server/05_vscode_with_docker_sandboxed_server.py	❌ FAIL Exit code 1	12.7s	--
02_remote_agent_server/07_convo_with_cloud_workspace.py	✅ PASS	45.0s	$0.03
02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py	❌ FAIL Exit code 1	2m 39s	--
04_llm_specific_tools/01_gpt5_apply_patch_preset.py	✅ PASS	26.0s	$0.02
04_llm_specific_tools/02_gemini_file_tools.py	✅ PASS	1m 9s	$0.06
05_skills_and_plugins/01_loading_agentskills/main.py	✅ PASS	8.9s	$0.01
05_skills_and_plugins/02_loading_plugins/main.py	✅ PASS	5.4s	$0.01

❌ Some tests failed

Total: 37 | Passed: 30 | Failed: 7 | Total Cost: $2.34

Failed examples:

examples/01_standalone_sdk/25_agent_delegation.py: Timed out after 600 seconds
examples/01_standalone_sdk/30_tom_agent.py: Exit code 1
examples/02_remote_agent_server/02_convo_with_docker_sandboxed_server.py: Exit code 1
examples/02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py: Exit code 1
examples/02_remote_agent_server/04_convo_with_api_sandboxed_server.py: Exit code 1
examples/02_remote_agent_server/05_vscode_with_docker_sandboxed_server.py: Exit code 1
examples/02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py: Exit code 1

View full workflow run

xingyaoww · 2026-02-04T15:12:50Z

@OpenHands please investigate #1889 (comment) and tell me WHY the example failed for 02_remote_agent_server/02_convo_with_docker_sandboxed_server.py?

openhands-ai · 2026-02-04T15:13:01Z

I'm on it! xingyaoww can track my progress at all-hands.dev

openhands-ai · 2026-02-04T15:15:38Z

Why the example `02_convo_with_docker_sandboxed_server.py` failed

The failure is NOT related to the tmux session race condition fix in this PR. It's a schema mismatch between the SDK client and the agent server.

Root Cause

The error from the logs is:

HTTP request failed (422 Unprocessable Content): {'detail': [
  {'type': 'extra_forbidden', 'loc': ['body', 'content', 0, 'TextContent', 'enable_truncation'], 
   'msg': 'Extra inputs are not permitted', 'input': True},
  ...
]}

What's happening:

The example script uses a hardcoded image: ghcr.io/openhands/agent-server:latest-python (line 50)
This latest-python tag points to the latest released version of the agent server
The SDK code being tested is from the PR branch, which includes a new field enable_truncation in TextContent
When the SDK client sends a message with enable_truncation: True, the older agent server rejects it as an unknown field (HTTP 422)

Why other examples work

04_convo_with_api_sandboxed_server.py uses GITHUB_SHA to dynamically select the PR-specific image:

server_image_sha = os.getenv("GITHUB_SHA") or "main"
server_image = f"ghcr.io/openhands/agent-server:{server_image_sha[:7]}-python-amd64"

07_convo_with_cloud_workspace.py passed because it uses a cloud workspace that likely has the updated server

The fix needed (not in this PR's scope)

The Docker-based examples (02_convo_with_docker_sandboxed_server.py, 03_browser_use_with_docker_sandboxed_server.py, 05_vscode_with_docker_sandboxed_server.py, 08_convo_with_apptainer_sandboxed_server.py) should be updated to use the PR-specific image tag similar to how 04_convo_with_api_sandboxed_server.py does it, or use an environment variable to override the image.

Summary

This is a pre-existing test infrastructure issue where some examples use latest-python instead of the PR-specific image tag. The tmux race condition fix in this PR is unrelated to this failure.

View full conversation

When running in CI (GITHUB_SHA is set), use the PR-specific agent-server image tag to ensure the SDK client and server are in sync. This prevents schema mismatch errors like 'enable_truncation' field not being recognized by the server. The examples now use the same pattern as 04_convo_with_api_sandboxed_server.py. Co-authored-by: openhands <[email protected]>

github-actions · 2026-02-04T15:24:41Z

🔄 Running Examples with `openhands/claude-haiku-4-5-20251001`

Generated: 2026-02-04 15:46:02 UTC

Example	Status	Duration	Cost
01_standalone_sdk/02_custom_tools.py	✅ PASS	27.6s	$0.03
01_standalone_sdk/03_activate_skill.py	✅ PASS	40.8s	$0.03
01_standalone_sdk/05_use_llm_registry.py	✅ PASS	10.5s	$0.01
01_standalone_sdk/07_mcp_integration.py	✅ PASS	30.7s	$0.03
01_standalone_sdk/09_pause_example.py	✅ PASS	19.4s	$0.01
01_standalone_sdk/10_persistence.py	✅ PASS	31.6s	$0.04
01_standalone_sdk/11_async.py	✅ PASS	34.8s	$0.04
01_standalone_sdk/12_custom_secrets.py	✅ PASS	15.7s	$0.02
01_standalone_sdk/13_get_llm_metrics.py	✅ PASS	22.6s	$0.03
01_standalone_sdk/14_context_condenser.py	❌ FAIL Timed out after 600 seconds	10m 0s	--
01_standalone_sdk/17_image_input.py	✅ PASS	12.7s	$0.02
01_standalone_sdk/18_send_message_while_processing.py	✅ PASS	17.1s	$0.01
01_standalone_sdk/19_llm_routing.py	✅ PASS	16.5s	$0.02
01_standalone_sdk/20_stuck_detector.py	✅ PASS	11.8s	$0.02
01_standalone_sdk/21_generate_extraneous_conversation_costs.py	✅ PASS	10.7s	$0.00
01_standalone_sdk/22_anthropic_thinking.py	✅ PASS	24.6s	$0.02
01_standalone_sdk/23_responses_reasoning.py	✅ PASS	1m 18s	$0.02
01_standalone_sdk/24_planning_agent_workflow.py	✅ PASS	5m 31s	$0.45
01_standalone_sdk/25_agent_delegation.py	❌ FAIL Timed out after 600 seconds	10m 0s	$0.49
01_standalone_sdk/26_custom_visualizer.py	✅ PASS	19.9s	$0.02
01_standalone_sdk/28_ask_agent_example.py	✅ PASS	26.5s	$0.03
01_standalone_sdk/29_llm_streaming.py	✅ PASS	37.8s	$0.03
01_standalone_sdk/30_tom_agent.py	❌ FAIL Exit code 1	2.1s	--
01_standalone_sdk/31_iterative_refinement.py	✅ PASS	3m 46s	$0.29
01_standalone_sdk/32_configurable_security_policy.py	✅ PASS	17.4s	$0.02
01_standalone_sdk/34_critic_example.py	✅ PASS	8.1s	$0.00
02_remote_agent_server/01_convo_with_local_agent_server.py	✅ PASS	53.3s	$0.05
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py	❌ FAIL Exit code 1	39.1s	--
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py	❌ FAIL Exit code 1	12.7s	--
02_remote_agent_server/04_convo_with_api_sandboxed_server.py	❌ FAIL Exit code 1	1m 51s	--
02_remote_agent_server/05_vscode_with_docker_sandboxed_server.py	❌ FAIL Exit code 1	16.3s	--
02_remote_agent_server/07_convo_with_cloud_workspace.py	✅ PASS	28.1s	$0.02
02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py	❌ FAIL Exit code 1	2m 42s	--
04_llm_specific_tools/01_gpt5_apply_patch_preset.py	✅ PASS	25.9s	$0.02
04_llm_specific_tools/02_gemini_file_tools.py	✅ PASS	55.9s	$0.06
05_skills_and_plugins/01_loading_agentskills/main.py	✅ PASS	14.6s	$0.02
05_skills_and_plugins/02_loading_plugins/main.py	✅ PASS	6.5s	$0.01

❌ Some tests failed

Total: 37 | Passed: 29 | Failed: 8 | Total Cost: $1.84

Failed examples:

examples/01_standalone_sdk/14_context_condenser.py: Timed out after 600 seconds
examples/01_standalone_sdk/25_agent_delegation.py: Timed out after 600 seconds
examples/01_standalone_sdk/30_tom_agent.py: Exit code 1
examples/02_remote_agent_server/02_convo_with_docker_sandboxed_server.py: Exit code 1
examples/02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py: Exit code 1
examples/02_remote_agent_server/04_convo_with_api_sandboxed_server.py: Exit code 1
examples/02_remote_agent_server/05_vscode_with_docker_sandboxed_server.py: Exit code 1
examples/02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py: Exit code 1

View full workflow run

openhands-ai · 2026-02-04T15:28:04Z

Looks like there are a few issues preventing this PR from being merged!

GitHub Actions are failing:
- Agent Server

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1889 at branch `fix/tmux-session-race-condition`

Feel free to include any additional details that might help me get this PR into a better state.

_{^{You can manage your notification settings}}

xingyaoww · 2026-02-04T15:53:33Z

Investigation: Why `02_convo_with_docker_sandboxed_server.py` Failed

Root Cause Analysis

The failure is NOT related to the schema mismatch issue I fixed earlier. After my fix, the examples now correctly use the PR-specific agent-server image (ghcr.io/openhands/agent-server:aedb881-python-amd64).

The actual failure is the tmux race condition that this PR is supposed to fix:

[DOCKER] Tmux session creation failed (attempt 1/3), retrying in 0.5s: Could not find object
[DOCKER] Tmux session creation failed (attempt 2/3), retrying in 1.0s: Could not find object
[DOCKER] RuntimeError: Failed to create tmux session after 3 attempts

Key Observations

The file-based locking IS in the image - The PR's changes are included in the merge commit aedb881
The race condition is happening within a single session creation - There's only ONE conversation being created, but all 3 retry attempts fail. This suggests the issue is not cross-process contention (which the file lock addresses), but rather an issue within libtmux itself or the Docker environment.
Local agent server works, Docker fails:
- 01_convo_with_local_agent_server.py - ✅ PASSED
- 02_convo_with_docker_sandboxed_server.py - ❌ FAILED
Timing analysis from logs:
- 15:29:36,632 - TmuxTerminal detected
- 15:29:36,642 - Attempt 1 fails (10ms later)
- 15:29:37,153 - Attempt 2 fails (511ms later)
- 15:29:38,164 - Attempt 3 fails (1011ms later)
All 3 attempts fail consistently, suggesting a fundamental issue rather than a timing race.

Hypotheses

Docker-specific tmux issue: The tmux server in the Docker container might not be fully initialized or might have different behavior than on the host.
libtmux internal race: The new_session() method creates a session then immediately queries for it. Even with external locking, this internal race can still occur.
Environment differences: Something in the Docker container's environment might be affecting tmux behavior.

Next Steps

I'm going to:

Pull the Docker image locally and investigate tmux behavior inside the container
Review git commits since the 1.10.0 release to identify any changes that could have introduced this issue

xingyaoww · 2026-02-04T15:59:09Z

🔍 Root Cause Found: Zombie Processes in Docker Container

I've reproduced and identified the root cause of the tmux session creation failures.

The Problem

When running the agent-server in Docker, tmux server processes become zombies and are not properly reaped:

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
openhan+       1  0.9  0.0   2856  1920 ?        Ss   15:57   0:00 /usr/local/bin/openhands-agent-server --host 0.0.0.0 --port 8000
openhan+      90  0.0  0.0      0     0 ?        Zs   15:58   0:00 [tmux: server] <defunct>
openhan+     100  0.0  0.0      0     0 ?        Zs   15:58   0:00 [tmux: server] <defunct>

Why This Happens

PID 1 is the agent-server binary - In Docker, the main process runs as PID 1
PID 1 has special responsibilities - It must handle SIGCHLD signals to reap zombie child processes
The agent-server binary doesn't reap children - When tmux spawns server processes, they become zombies after exiting
libtmux race condition - When new_session() creates a session, the tmux server process may exit and become a zombie before libtmux can query it, causing TmuxObjectDoesNotExist

Evidence

The tmux session IS created (visible via tmux list-sessions)
But libtmux can't find it immediately after creation
Multiple zombie [tmux: server] <defunct> processes accumulate

Solution Options

Use --init flag when running Docker - This adds a proper init process (tini) as PID 1:
```
docker run --init -d ... ghcr.io/openhands/agent-server:...
```
Add tini to the Docker image - Include tini in the Dockerfile and use it as the entrypoint
Handle SIGCHLD in the agent-server - Add signal handling to reap zombie processes

Immediate Fix for Examples

The Docker-based examples should use --init flag. I'll update the DockerWorkspace to include this.

Note

This is NOT a bug in the tmux locking code in this PR. The file-based locking is working correctly. The issue is that the Docker container environment doesn't properly handle child process cleanup, which causes the libtmux race condition to manifest even for single session creation.

xingyaoww · 2026-02-04T16:01:15Z

Seems like this is not the root fix, but i think this is a net improvement, so merging it..

openhands-agent added 6 commits February 3, 2026 18:25

debug: add detailed logging for tmux session creation lock

cea93ca

style: fix formatting issues

e3039c2

ci: trigger rebuild with latest code

1abe3d7

neubig force-pushed the fix/tmux-session-race-condition branch from 2b32a22 to 8e576fb Compare February 3, 2026 20:50

neubig force-pushed the fix/tmux-session-race-condition branch from 8e576fb to eb1b0a1 Compare February 3, 2026 20:57

neubig requested a review from xingyaoww February 3, 2026 20:58

Merge branch 'main' into fix/tmux-session-race-condition

525215c

neubig marked this pull request as ready for review February 3, 2026 20:58

neubig changed the title ~~DRAFT: fix: serialize tmux session creation to prevent race conditions~~ fix: serialize tmux session creation to prevent race conditions Feb 3, 2026

all-hands-bot reviewed Feb 3, 2026

View reviewed changes

fix: address review comments for tmux lock file

949d6b8

- Move os.open() inside try block to prevent fd leak on exception - Add mode=0o666 for multi-user compatibility

This comment was marked as duplicate.

Sign in to view

Merge branch 'main' into fix/tmux-session-race-condition

f81e204

xingyaoww reviewed Feb 4, 2026

View reviewed changes

refactor: simplify tmux session locking

db3928a

- Remove redundant thread lock (fcntl.flock provides both cross-process and intra-process synchronization) - Namespace lock file by UID to prevent unnecessary cross-user serialization

xingyaoww approved these changes Feb 4, 2026

View reviewed changes

xingyaoww added the test-examples Run all applicable "examples/" files. Expensive operation. label Feb 4, 2026

xingyaoww and others added 2 commits February 4, 2026 23:18

Merge branch 'main' into fix/tmux-session-race-condition

f77e99e

xingyaoww removed the test-examples Run all applicable "examples/" files. Expensive operation. label Feb 4, 2026

xingyaoww added the test-examples Run all applicable "examples/" files. Expensive operation. label Feb 4, 2026 — with OpenHands AI

xingyaoww merged commit 20291bd into main Feb 4, 2026
47 of 49 checks passed

xingyaoww deleted the fix/tmux-session-race-condition branch February 4, 2026 16:01

xingyaoww mentioned this pull request Feb 4, 2026

Docker/Apptainer sandboxed server example tests failing in CI #1886

Open

	_TMUX_LOCK_FILE = "/tmp/openhands-tmux-session.lock"
	_TMUX_LOCK_FILE = f"/tmp/openhands-tmux-session-{os.getenv(\"USER\", \"default\")}.lock"

		lock_fd = os.open(_TMUX_LOCK_FILE, os.O_CREAT \| os.O_RDWR)
		try:

Conversation

neubig commented Feb 3, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Changes

Testing

Uh oh!

github-actions bot commented Feb 3, 2026

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

neubig commented Feb 3, 2026

Uh oh!

openhands-ai bot commented Feb 3, 2026

Uh oh!

openhands-ai bot commented Feb 3, 2026

Review Comments Addressed

Changes Made

Uh oh!

enyst commented Feb 4, 2026

Uh oh!

openhands-ai bot commented Feb 4, 2026

Uh oh!

enyst commented Feb 4, 2026

Uh oh!

This comment was marked as duplicate.

xingyaoww commented Feb 4, 2026

Uh oh!

openhands-ai bot commented Feb 4, 2026

Uh oh!

Choose a reason for hiding this comment

/codereview-roasted

Linus-Style Analysis

VERDICT

KEY INSIGHT

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openhands-ai bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

openhands-ai bot Feb 4, 2026

Choose a reason for hiding this comment

Changes Made

Summary of Changes

Uh oh!

openhands-ai bot commented Feb 4, 2026

Summary

Key Findings

Uh oh!

xingyaoww left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔄 Running Examples with openhands/claude-haiku-4-5-20251001

❌ Some tests failed

Uh oh!

xingyaoww commented Feb 4, 2026

Uh oh!

openhands-ai bot commented Feb 4, 2026

Uh oh!

openhands-ai bot commented Feb 4, 2026

Why the example 02_convo_with_docker_sandboxed_server.py failed

Root Cause

neubig commented Feb 3, 2026 •

edited by github-actions bot

Loading

xingyaoww left a comment •

edited

Loading

github-actions bot commented Feb 4, 2026 •

edited

Loading

🔄 Running Examples with `openhands/claude-haiku-4-5-20251001`

Why the example `02_convo_with_docker_sandboxed_server.py` failed

github-actions bot commented Feb 4, 2026 •

edited

Loading

🔄 Running Examples with `openhands/claude-haiku-4-5-20251001`

Investigation: Why `02_convo_with_docker_sandboxed_server.py` Failed