fix: serialize tmux session creation to prevent race conditions#1889
fix: serialize tmux session creation to prevent race conditions#1889
Conversation
When many parallel agents start simultaneously, they all attempt to create tmux sessions at once. The libtmux library has race conditions where new_session() can create a session but then fail to find it immediately, throwing TmuxObjectDoesNotExist. This fix adds a global threading.Lock to serialize tmux session creation, eliminating the race condition. Co-authored-by: openhands <[email protected]>
…nization The threading.Lock() only works within a single process. When uvicorn spawns multiple workers or when there are multiple Python processes, each has its own lock. This change uses fcntl.flock() on a shared lock file, which works across processes since they all share the same filesystem.
…ialization Use fcntl.flock() for cross-process synchronization of tmux session creation. This prevents race conditions when multiple processes (e.g., uvicorn workers) try to create tmux sessions simultaneously. The file lock is combined with a threading lock to handle both: - Multi-process concurrency (file lock) - Multi-thread concurrency within same process (thread lock)
2b32a22 to
8e576fb
Compare
Serialize tmux session creation using fcntl.flock() to prevent race conditions when multiple agents share the same Docker workspace. The libtmux library has a race condition where new_session() can create a session but then fail to find it immediately. This fix ensures only one session is created at a time within the container. Co-authored-by: openhands <[email protected]>
8e576fb to
eb1b0a1
Compare
all-hands-bot
left a comment
There was a problem hiding this comment.
Good approach to fixing the race condition! The dual-lock strategy (thread + file) is sound. Found a few issues that should be addressed before merging.
|
|
||
| # Lock for serializing tmux session creation to prevent race conditions in libtmux | ||
| # Uses both file lock (cross-process) and thread lock (same-process) | ||
| _TMUX_LOCK_FILE = "/tmp/openhands-tmux-session.lock" |
There was a problem hiding this comment.
🟠 Important: Lock file permissions could cause issues in multi-user environments.
If user A creates this file first, user B may not have permission to access it. Consider using a user-specific lock file path:
| _TMUX_LOCK_FILE = "/tmp/openhands-tmux-session.lock" | |
| _TMUX_LOCK_FILE = f"/tmp/openhands-tmux-session-{os.getenv(\"USER\", \"default\")}.lock" |
Alternatively, ensure the lock file has permissive permissions (0o666) when created.
There was a problem hiding this comment.
^this could be real
Also if we use file-based lock, why do we need thread lock? Are both of them necessary at the same time?
| lock_fd = os.open(_TMUX_LOCK_FILE, os.O_CREAT | os.O_RDWR) | ||
| try: |
There was a problem hiding this comment.
🟠 Important: Potential file descriptor leak if an exception occurs between os.open() and the try block.
Move the os.open() inside the try block to ensure the fd is always closed:
| lock_fd = os.open(_TMUX_LOCK_FILE, os.O_CREAT | os.O_RDWR) | |
| try: | |
| try: | |
| lock_fd = os.open(_TMUX_LOCK_FILE, os.O_CREAT | os.O_RDWR) |
| for attempt in range(max_retries): | ||
| # Serialize tmux session creation to prevent libtmux race conditions | ||
| with _TMUX_THREAD_LOCK: | ||
| lock_fd = os.open(_TMUX_LOCK_FILE, os.O_CREAT | os.O_RDWR) |
There was a problem hiding this comment.
🟡 Suggestion: Consider setting explicit permissions on the lock file for better multi-user compatibility.
You could add mode=0o666 or use os.chmod() after creation to ensure all users can access the lock file.
| # Serialize tmux session creation to prevent libtmux race conditions | ||
| with _TMUX_THREAD_LOCK: | ||
| lock_fd = os.open(_TMUX_LOCK_FILE, os.O_CREAT | os.O_RDWR) | ||
| try: | ||
| self.session = self.server.new_session( | ||
| session_name=session_name, | ||
| start_directory=self.work_dir, | ||
| kill_session=True, | ||
| x=1000, | ||
| y=1000, | ||
| ) | ||
| break | ||
| except TmuxObjectDoesNotExist as e: | ||
| last_error = e | ||
| if attempt < max_retries - 1: | ||
| logger.warning( | ||
| f"Tmux session creation failed (attempt {attempt + 1}/" | ||
| f"{max_retries}), retrying in {retry_delay}s: {e}" | ||
| ) | ||
| time.sleep(retry_delay) | ||
| retry_delay *= 2 # Exponential backoff | ||
| else: | ||
| raise RuntimeError( | ||
| f"Failed to create tmux session after {max_retries} attempts" | ||
| ) from last_error | ||
| fcntl.flock(lock_fd, fcntl.LOCK_EX) | ||
| max_retries = 3 | ||
| retry_delay = 0.5 | ||
| last_error = None | ||
| for attempt in range(max_retries): | ||
| try: | ||
| self.session = self.server.new_session( | ||
| session_name=session_name, | ||
| start_directory=self.work_dir, | ||
| kill_session=True, | ||
| x=1000, | ||
| y=1000, | ||
| ) | ||
| break | ||
| except TmuxObjectDoesNotExist as e: | ||
| last_error = e | ||
| if attempt < max_retries - 1: | ||
| logger.warning( | ||
| f"Tmux session creation failed (attempt {attempt + 1}/" | ||
| f"{max_retries}), retrying in {retry_delay}s: {e}" | ||
| ) | ||
| time.sleep(retry_delay) | ||
| retry_delay *= 2 | ||
| else: | ||
| raise RuntimeError( | ||
| f"Failed to create tmux session after {max_retries} attempts" | ||
| ) from last_error | ||
| finally: | ||
| fcntl.flock(lock_fd, fcntl.LOCK_UN) | ||
| os.close(lock_fd) |
There was a problem hiding this comment.
🟡 Suggestion: The retry logic might be redundant now with proper locking in place.
The retries were originally added to work around the race condition that the locks now prevent. Consider removing the retry logic once this is proven stable in production, as it adds complexity and delay for genuine errors.
That said, keeping it as a defensive measure is reasonable if you want extra safety.
| last_error = None | ||
| for attempt in range(max_retries): | ||
| # Serialize tmux session creation to prevent libtmux race conditions | ||
| with _TMUX_THREAD_LOCK: |
There was a problem hiding this comment.
🟢 Nit: Lock ordering is correct (thread lock before file lock). Good practice to avoid potential deadlocks.
|
@OpenHands reflect all the review comments, and try to keep the overall PR changes as concise as possible |
|
I'm on it! neubig can track my progress at all-hands.dev |
- Move os.open() inside try block to prevent fd leak on exception - Add mode=0o666 for multi-user compatibility
|
I've addressed the review comments on PR #1889 and pushed the changes. Here's a summary: Review Comments Addressed
Changes MadeThe commit adds only 5 lines and removes 3 lines, keeping the changes minimal and focused:
The changes have been pushed to the |
|
@OpenHands Do a /codereview-roasted on this PR. Do not exaggerate, investigate Post your feedback as a comment on the PR. Note that it will be rendered as markdown. |
|
I'm on it! enyst can track my progress at all-hands.dev |
|
/codereview-roasted Taste rating: 🟡 Acceptable — fixes a real race, but the locking is a bit heavy‑handed and could be safer. Linus‑style analysis [CRITICAL ISSUES]
[IMPROVEMENT OPPORTUNITIES]
[STYLE NOTES]
Repro note Verdict: ✅ Worth merging, but please consider tightening the lock scope and preventing FD inheritance to avoid hidden stalls. Key insight: This is the right fix direction, but the lock should be as small and as safe as possible — otherwise we trade a rare race for steady contention. |
This comment was marked as duplicate.
This comment was marked as duplicate.
|
@OpenHands do a /codereview-roasted, consider these comment i raised: #1889 (comment) |
|
I'm on it! xingyaoww can track my progress at all-hands.dev |
| # Lock for serializing tmux session creation to prevent race conditions in libtmux | ||
| # Uses both file lock (cross-process) and thread lock (same-process) | ||
| _TMUX_LOCK_FILE = "/tmp/openhands-tmux-session.lock" | ||
| _TMUX_THREAD_LOCK = threading.Lock() |
There was a problem hiding this comment.
/codereview-roasted
Taste rating: 🟡 Acceptable — fixes a real race condition, but the locking mechanism is over-engineered.
Linus-Style Analysis
[CRITICAL ISSUES]
- None found. The core problem (libtmux race between
new-sessionandlist-sessions) is real and the fix is pragmatic.
[IMPROVEMENT OPPORTUNITIES]
1. [tmux_terminal.py, Lines 24, 66] The thread lock is redundant
xingyaoww's question is spot-on. According to flock(2) semantics:
"An attempt to lock the file using one of these file descriptors may be denied by a lock that the calling process has already placed via another file descriptor."
This means if thread A holds an exclusive flock() on fd1, thread B trying to acquire an exclusive flock() on fd2 (same file, different fd) will block. The file lock alone handles both inter-process AND intra-thread synchronization.
The thread lock adds:
- Extra complexity (nested locking)
- A module-level global that could cause issues with multiprocessing (fork inherits the lock state)
- No actual benefit
Suggested simplification:
# Remove _TMUX_THREAD_LOCK entirely
# Just use the file lock:
lock_fd = None
try:
lock_fd = os.open(_TMUX_LOCK_FILE, os.O_CREAT | os.O_RDWR, 0o666)
fcntl.flock(lock_fd, fcntl.LOCK_EX)
# ... session creation logic ...
finally:
if lock_fd is not None:
fcntl.flock(lock_fd, fcntl.LOCK_UN)
os.close(lock_fd)2. [tmux_terminal.py, Line 23] Lock file path could conflict across users
The previous comment about multi-user environments is valid. While mode=0o666 helps with permissions, a shared /tmp/openhands-tmux-session.lock means:
- User A's agent and User B's agent serialize against each other unnecessarily
- They're using different tmux servers (different UIDs), so there's no actual race between them
Consider namespacing by UID:
_TMUX_LOCK_FILE = f"/tmp/openhands-tmux-session-{os.getuid()}.lock"3. [tmux_terminal.py, Lines 71-96] Lock scope is wider than necessary
The lock is held during retry sleeps (0.5s, 1s, 2s). If the race condition is truly in libtmux's new_session() → from_session_id() sequence, you only need to serialize the new_session() call itself, not the retries.
However, I'd argue this is acceptable because:
- Holding the lock during retries prevents thundering herd
- If one process is failing, others waiting is probably the right behavior
- The retry logic might be unnecessary entirely if the lock works
[STYLE NOTES]
4. Consider whether retries are still needed
With proper serialization, the race condition should be eliminated. The retry logic might be dead code now. Consider:
- Keeping it as defense-in-depth (acceptable)
- Removing it to simplify (preferred if you trust the lock)
- Adding a comment explaining why it's kept
VERDICT
✅ Worth merging with minor changes: The fix addresses a real production issue. Suggested improvements:
- Remove the thread lock — it's redundant and adds complexity
- Namespace the lock file by UID — prevents unnecessary cross-user serialization
KEY INSIGHT
The thread lock is solving a problem that flock() already solves. When in doubt, trust the kernel's locking primitives over userspace reimplementations.
There was a problem hiding this comment.
@OpenHands please do these:
Remove the thread lock — it's redundant and adds complexity
Namespace the lock file by UID — prevents unnecessary cross-user serialization
There was a problem hiding this comment.
I'm on it! xingyaoww can track my progress at all-hands.dev
There was a problem hiding this comment.
I've addressed both requested changes and pushed them to the PR branch:
Changes Made
✅ Removed the thread lock - Removed threading.Lock() and the with _TMUX_THREAD_LOCK: wrapper since fcntl.flock() already provides both cross-process AND intra-process synchronization (per flock(2) semantics).
✅ Namespaced the lock file by UID - Changed the lock file path from /tmp/openhands-tmux-session.lock to /tmp/openhands-tmux-session-{os.getuid()}.lock to prevent unnecessary cross-user serialization.
Summary of Changes
- Removed
import threading - Removed
_TMUX_THREAD_LOCK = threading.Lock() - Changed
_TMUX_LOCK_FILEto include UID:f"/tmp/openhands-tmux-session-{os.getuid()}.lock" - Removed the
with _TMUX_THREAD_LOCK:wrapper, keeping only the file lock
The changes have been pushed to the fix/tmux-session-race-condition branch and will update PR #1889.
|
I've completed the /codereview-roasted for PR #1889, addressing xingyaoww's comment about whether both the thread lock and file lock are necessary. SummaryI posted a detailed code review comment on the PR at: #1889 (comment) Key FindingsTaste Rating: 🟡 Acceptable — The fix addresses a real race condition in libtmux, but the locking mechanism is over-engineered. Main Points:
Recommendations:
Verdict: ✅ Worth merging with the suggested minor simplifications. No code changes were made to the branch - this was a review-only task as requested. |
- Remove redundant thread lock (fcntl.flock provides both cross-process and intra-process synchronization) - Namespace lock file by UID to prevent unnecessary cross-user serialization
🔄 Running Examples with
|
| Example | Status | Duration | Cost |
|---|---|---|---|
| 01_standalone_sdk/02_custom_tools.py | ✅ PASS | 24.2s | $0.03 |
| 01_standalone_sdk/03_activate_skill.py | ✅ PASS | 21.7s | $0.03 |
| 01_standalone_sdk/05_use_llm_registry.py | ✅ PASS | 9.4s | $0.01 |
| 01_standalone_sdk/07_mcp_integration.py | ✅ PASS | 30.6s | $0.03 |
| 01_standalone_sdk/09_pause_example.py | ✅ PASS | 12.1s | $0.01 |
| 01_standalone_sdk/10_persistence.py | ✅ PASS | 24.8s | $0.02 |
| 01_standalone_sdk/11_async.py | ✅ PASS | 34.6s | $0.03 |
| 01_standalone_sdk/12_custom_secrets.py | ✅ PASS | 13.9s | $0.01 |
| 01_standalone_sdk/13_get_llm_metrics.py | ✅ PASS | 21.2s | $0.02 |
| 01_standalone_sdk/14_context_condenser.py | ✅ PASS | 4m 14s | $0.53 |
| 01_standalone_sdk/17_image_input.py | ✅ PASS | 15.0s | $0.02 |
| 01_standalone_sdk/18_send_message_while_processing.py | ✅ PASS | 26.0s | $0.02 |
| 01_standalone_sdk/19_llm_routing.py | ✅ PASS | 11.3s | $0.02 |
| 01_standalone_sdk/20_stuck_detector.py | ✅ PASS | 13.5s | $0.02 |
| 01_standalone_sdk/21_generate_extraneous_conversation_costs.py | ✅ PASS | 8.8s | $0.00 |
| 01_standalone_sdk/22_anthropic_thinking.py | ✅ PASS | 16.8s | $0.01 |
| 01_standalone_sdk/23_responses_reasoning.py | ✅ PASS | 52.1s | $0.01 |
| 01_standalone_sdk/24_planning_agent_workflow.py | ✅ PASS | 5m 15s | $0.38 |
| 01_standalone_sdk/25_agent_delegation.py | ❌ FAIL Timed out after 600 seconds |
10m 0s | $0.56 |
| 01_standalone_sdk/26_custom_visualizer.py | ✅ PASS | 23.7s | $0.02 |
| 01_standalone_sdk/28_ask_agent_example.py | ✅ PASS | 39.8s | $0.02 |
| 01_standalone_sdk/29_llm_streaming.py | ✅ PASS | 57.7s | $0.02 |
| 01_standalone_sdk/30_tom_agent.py | ❌ FAIL Exit code 1 |
2.4s | -- |
| 01_standalone_sdk/31_iterative_refinement.py | ✅ PASS | 4m 10s | $0.31 |
| 01_standalone_sdk/32_configurable_security_policy.py | ✅ PASS | 15.7s | $0.02 |
| 01_standalone_sdk/34_critic_example.py | ✅ PASS | 9.7s | $0.00 |
| 02_remote_agent_server/01_convo_with_local_agent_server.py | ✅ PASS | 40.4s | $0.03 |
| 02_remote_agent_server/02_convo_with_docker_sandboxed_server.py | ❌ FAIL Exit code 1 |
40.8s | -- |
| 02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py | ❌ FAIL Exit code 1 |
27.5s | -- |
| 02_remote_agent_server/04_convo_with_api_sandboxed_server.py | ❌ FAIL Exit code 1 |
5m 11s | -- |
| 02_remote_agent_server/05_vscode_with_docker_sandboxed_server.py | ❌ FAIL Exit code 1 |
12.7s | -- |
| 02_remote_agent_server/07_convo_with_cloud_workspace.py | ✅ PASS | 45.0s | $0.03 |
| 02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py | ❌ FAIL Exit code 1 |
2m 39s | -- |
| 04_llm_specific_tools/01_gpt5_apply_patch_preset.py | ✅ PASS | 26.0s | $0.02 |
| 04_llm_specific_tools/02_gemini_file_tools.py | ✅ PASS | 1m 9s | $0.06 |
| 05_skills_and_plugins/01_loading_agentskills/main.py | ✅ PASS | 8.9s | $0.01 |
| 05_skills_and_plugins/02_loading_plugins/main.py | ✅ PASS | 5.4s | $0.01 |
❌ Some tests failed
Total: 37 | Passed: 30 | Failed: 7 | Total Cost: $2.34
Failed examples:
- examples/01_standalone_sdk/25_agent_delegation.py: Timed out after 600 seconds
- examples/01_standalone_sdk/30_tom_agent.py: Exit code 1
- examples/02_remote_agent_server/02_convo_with_docker_sandboxed_server.py: Exit code 1
- examples/02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py: Exit code 1
- examples/02_remote_agent_server/04_convo_with_api_sandboxed_server.py: Exit code 1
- examples/02_remote_agent_server/05_vscode_with_docker_sandboxed_server.py: Exit code 1
- examples/02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py: Exit code 1
|
@OpenHands please investigate #1889 (comment) and tell me WHY the example failed for 02_remote_agent_server/02_convo_with_docker_sandboxed_server.py? |
|
I'm on it! xingyaoww can track my progress at all-hands.dev |
Why the example
|
When running in CI (GITHUB_SHA is set), use the PR-specific agent-server image tag to ensure the SDK client and server are in sync. This prevents schema mismatch errors like 'enable_truncation' field not being recognized by the server. The examples now use the same pattern as 04_convo_with_api_sandboxed_server.py. Co-authored-by: openhands <[email protected]>
🔄 Running Examples with
|
| Example | Status | Duration | Cost |
|---|---|---|---|
| 01_standalone_sdk/02_custom_tools.py | ✅ PASS | 27.6s | $0.03 |
| 01_standalone_sdk/03_activate_skill.py | ✅ PASS | 40.8s | $0.03 |
| 01_standalone_sdk/05_use_llm_registry.py | ✅ PASS | 10.5s | $0.01 |
| 01_standalone_sdk/07_mcp_integration.py | ✅ PASS | 30.7s | $0.03 |
| 01_standalone_sdk/09_pause_example.py | ✅ PASS | 19.4s | $0.01 |
| 01_standalone_sdk/10_persistence.py | ✅ PASS | 31.6s | $0.04 |
| 01_standalone_sdk/11_async.py | ✅ PASS | 34.8s | $0.04 |
| 01_standalone_sdk/12_custom_secrets.py | ✅ PASS | 15.7s | $0.02 |
| 01_standalone_sdk/13_get_llm_metrics.py | ✅ PASS | 22.6s | $0.03 |
| 01_standalone_sdk/14_context_condenser.py | ❌ FAIL Timed out after 600 seconds |
10m 0s | -- |
| 01_standalone_sdk/17_image_input.py | ✅ PASS | 12.7s | $0.02 |
| 01_standalone_sdk/18_send_message_while_processing.py | ✅ PASS | 17.1s | $0.01 |
| 01_standalone_sdk/19_llm_routing.py | ✅ PASS | 16.5s | $0.02 |
| 01_standalone_sdk/20_stuck_detector.py | ✅ PASS | 11.8s | $0.02 |
| 01_standalone_sdk/21_generate_extraneous_conversation_costs.py | ✅ PASS | 10.7s | $0.00 |
| 01_standalone_sdk/22_anthropic_thinking.py | ✅ PASS | 24.6s | $0.02 |
| 01_standalone_sdk/23_responses_reasoning.py | ✅ PASS | 1m 18s | $0.02 |
| 01_standalone_sdk/24_planning_agent_workflow.py | ✅ PASS | 5m 31s | $0.45 |
| 01_standalone_sdk/25_agent_delegation.py | ❌ FAIL Timed out after 600 seconds |
10m 0s | $0.49 |
| 01_standalone_sdk/26_custom_visualizer.py | ✅ PASS | 19.9s | $0.02 |
| 01_standalone_sdk/28_ask_agent_example.py | ✅ PASS | 26.5s | $0.03 |
| 01_standalone_sdk/29_llm_streaming.py | ✅ PASS | 37.8s | $0.03 |
| 01_standalone_sdk/30_tom_agent.py | ❌ FAIL Exit code 1 |
2.1s | -- |
| 01_standalone_sdk/31_iterative_refinement.py | ✅ PASS | 3m 46s | $0.29 |
| 01_standalone_sdk/32_configurable_security_policy.py | ✅ PASS | 17.4s | $0.02 |
| 01_standalone_sdk/34_critic_example.py | ✅ PASS | 8.1s | $0.00 |
| 02_remote_agent_server/01_convo_with_local_agent_server.py | ✅ PASS | 53.3s | $0.05 |
| 02_remote_agent_server/02_convo_with_docker_sandboxed_server.py | ❌ FAIL Exit code 1 |
39.1s | -- |
| 02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py | ❌ FAIL Exit code 1 |
12.7s | -- |
| 02_remote_agent_server/04_convo_with_api_sandboxed_server.py | ❌ FAIL Exit code 1 |
1m 51s | -- |
| 02_remote_agent_server/05_vscode_with_docker_sandboxed_server.py | ❌ FAIL Exit code 1 |
16.3s | -- |
| 02_remote_agent_server/07_convo_with_cloud_workspace.py | ✅ PASS | 28.1s | $0.02 |
| 02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py | ❌ FAIL Exit code 1 |
2m 42s | -- |
| 04_llm_specific_tools/01_gpt5_apply_patch_preset.py | ✅ PASS | 25.9s | $0.02 |
| 04_llm_specific_tools/02_gemini_file_tools.py | ✅ PASS | 55.9s | $0.06 |
| 05_skills_and_plugins/01_loading_agentskills/main.py | ✅ PASS | 14.6s | $0.02 |
| 05_skills_and_plugins/02_loading_plugins/main.py | ✅ PASS | 6.5s | $0.01 |
❌ Some tests failed
Total: 37 | Passed: 29 | Failed: 8 | Total Cost: $1.84
Failed examples:
- examples/01_standalone_sdk/14_context_condenser.py: Timed out after 600 seconds
- examples/01_standalone_sdk/25_agent_delegation.py: Timed out after 600 seconds
- examples/01_standalone_sdk/30_tom_agent.py: Exit code 1
- examples/02_remote_agent_server/02_convo_with_docker_sandboxed_server.py: Exit code 1
- examples/02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py: Exit code 1
- examples/02_remote_agent_server/04_convo_with_api_sandboxed_server.py: Exit code 1
- examples/02_remote_agent_server/05_vscode_with_docker_sandboxed_server.py: Exit code 1
- examples/02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py: Exit code 1
|
Looks like there are a few issues preventing this PR from being merged!
If you'd like me to help, just leave a comment, like Feel free to include any additional details that might help me get this PR into a better state. You can manage your notification settings |
Investigation: Why
|
🔍 Root Cause Found: Zombie Processes in Docker ContainerI've reproduced and identified the root cause of the tmux session creation failures. The ProblemWhen running the agent-server in Docker, tmux server processes become zombies and are not properly reaped: Why This Happens
Evidence
Solution Options
Immediate Fix for ExamplesThe Docker-based examples should use NoteThis is NOT a bug in the tmux locking code in this PR. The file-based locking is working correctly. The issue is that the Docker container environment doesn't properly handle child process cleanup, which causes the libtmux race condition to manifest even for single session creation. |
|
Seems like this is not the root fix, but i think this is a net improvement, so merging it.. |
Problem
When multiple parallel agents share the same Docker workspace (same container), they all attempt to create tmux sessions on the same tmux server simultaneously. The
libtmuxlibrary has a race condition wherenew_session()creates a session but then fails to find it immediately, throwingTmuxObjectDoesNotExist.This causes agents to fail with:
Solution
Use file-based locking (
fcntl.flock()) to serialize tmux session creation within the container. This ensures only one session is created at a time, eliminating the race condition.Changes
_TMUX_LOCK_FILE = "/tmp/openhands-tmux-session.lock"_TMUX_THREAD_LOCK = threading.Lock()for intra-process safetyTesting
Tested with 20 parallel conversations in a single Docker workspace - all tmux sessions created successfully (0 failures).
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:aedb881-pythonRun
All tags pushed for this build
About Multi-Architecture Support
aedb881-python) is a multi-arch manifest supporting both amd64 and arm64aedb881-python-amd64) are also available if needed