Skip to content

fix: harden MCP exit stack cleanup to prevent cross-task cancel-scope errors#2805

Open
GhimBoon wants to merge 1 commit intoChainlit:mainfrom
GhimBoon:fix/mcp-exit-stack-leak
Open

fix: harden MCP exit stack cleanup to prevent cross-task cancel-scope errors#2805
GhimBoon wants to merge 1 commit intoChainlit:mainfrom
GhimBoon:fix/mcp-exit-stack-leak

Conversation

@GhimBoon
Copy link
Contributor

@GhimBoon GhimBoon commented Feb 26, 2026

Summary

This change fixes intermittent MCP connection/disconnection failures that surfaced as:

  • 400 responses during MCP lifecycle operations
  • RuntimeError: Attempted to exit cancel scope in a different task
  • orphaned cleanup failures after reconnect attempts

Root cause

MCP AsyncExitStack cleanup was not consistently guaranteed in all failure paths, and some exceptions are wrapped in BaseExceptionGroup, which prevented targeted suppression/handling of the known cancel-scope mismatch case.

What changed

  • Added robust MCP exit stack close handling that supports both RuntimeError and BaseExceptionGroup forms of the cancel-scope mismatch.
  • Added a recursive detector for cancel-scope mismatch inside nested exception groups.
  • Ensured failed connect flows always close the temporary exit stack in finally when ownership is not transferred.
  • Updated disconnect/session teardown paths to use the safe close helper consistently.
  • Guarded reconnect-time on_mcp_disconnect callback so callback failures do not break cleanup flow.
  • Expanded unit tests to cover:
    • BaseExceptionGroup-wrapped cancel-scope mismatch
    • recursive mismatch detection helper behavior
    • existing safe-close behavior

Validation

  • pytest tests/test_session.py -v → 46 passed
  • pytest tests/test_mcp.py -v → 39 passed

Impact

  • No functional change to successful MCP connections.
  • Improves stability and cleanup correctness on failed/partial MCP lifecycle events.
  • Reduces noisy task/cleanup errors and prevents resource leaks.

Fixes #2182


Summary by cubic

Run each MCP connection in its own background task that owns and closes its AsyncExitStack to eliminate cross-task cancel-scope errors, cleanup leaks, and intermittent 400s across connect, disconnect, and reconnect. Successful connections behave the same; shutdown and reconnection are now reliable.

  • Bug Fixes

    • Added McpSession with stop_event, client, task, and close() so the exit stack is closed in the same task; handles BaseException/BaseExceptionGroup.
    • Rewrote connect_mcp to validate client type, run in a background task with ready/stop events, return JSONResponse on errors, and safely reconnect by popping, callback-guarding, and closing the old session.
    • Updated disconnect_mcp and WebsocketSession.delete to call McpSession.close() with a timeout and cancel fallback for hanging tasks.
  • Refactors

    • Replaced (ClientSession, AsyncExitStack) tuples with McpSession; kept backward-compatible tuple-like iteration.
    • Simplified logging and error messages.

Written for commit 9bcc81d. Summary will update on new commits.

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. bug Something isn't working unit-tests Has unit tests. labels Feb 26, 2026
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 3 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="backend/chainlit/server.py">

<violation number="1" location="backend/chainlit/server.py:1422">
P2: exit_stack_stored is set to True before on_mcp_connect runs; if the callback raises, the finally block skips cleanup and the partially initialized session remains stored, leaking the exit stack/session.</violation>
</file>

<file name="backend/chainlit/session.py">

<violation number="1" location="backend/chainlit/session.py:34">
P2: BaseExceptionGroup is referenced without a compatibility import, but the project supports Python 3.10 where BaseExceptionGroup is undefined. This will raise NameError during exception matching on 3.10, breaking MCP exit stack cleanup.</violation>
</file>

Since this is your first cubic review, here's how it works:

  • cubic automatically reviews your code and comments on bugs and improvements
  • Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
  • Add one-off context when rerunning by tagging @cubic-dev-ai with guidance or docs links (including llms.txt)
  • Ask questions if you need clarification on any suggestion

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="backend/chainlit/session.py">

<violation number="1" location="backend/chainlit/session.py:47">
P2: BaseExceptionGroup is no longer caught; since it inherits from BaseException (not Exception), BaseExceptionGroup-wrapped cancel-scope errors will bypass `except Exception`, so `_is_cancel_scope_error` never runs and cleanup errors can propagate again.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@dokterbob
Copy link
Collaborator

Great contrib!

@upa, @nigiva, @senj, @bartozl and @Brid9e seem to have run into this (from convo on the issue) and probably know more about it then I do.

Quick review looks good but I'd prefer having a thumb's up from someone able and willing to replicate the problem and demonstrate the fix solves it!

@Proteusiq
Copy link

We run into this and currently patching it. The issue arises in Kubernetes deployment were CPU usage went to the roof

@bartozl
Copy link

bartozl commented Mar 2, 2026

We are also patching this, using the approach proposed here

I'm not sure the one proposed in this PR will really solve the issue, from what I can I see, the safe_mcp_exit_stack_close function prevents the Attempted to exit cancel scope in a different task than it was entered in error to be propagated, but how it ensures that the underling resources are actually cleaned?

@GhimBoon GhimBoon force-pushed the fix/mcp-exit-stack-leak branch 3 times, most recently from 2fc7646 to b04a321 Compare March 2, 2026 13:06
Each MCP connection now runs in its own asyncio.Task that owns the
AsyncExitStack. The task enters transports, initializes the
ClientSession, signals a ready event, then blocks on a stop event.
On shutdown, the exit stack is closed in the same task that opened it,
avoiding the cross-task cancel-scope corruption described in Chainlit#2182.

Based on the solution proposed by @nigiva:
Chainlit#2182 (comment)

Changes:
- Add McpSession dataclass (session.py) with stop_event, close(), and
  backward-compatible __iter__ for tuple unpacking
- Rewrite connect_mcp (server.py) to launch connections in background
  tasks with ready/stop event coordination
- Rewrite disconnect_mcp to use McpSession.close()
- Update WebsocketSession.delete() to iterate and close McpSession objects
- Catch BaseException (not just Exception) for streamablehttp transport
  errors (BaseExceptionGroup inherits from BaseException)
- Use nested try/finally to ensure exit stack cleanup in all code paths
- Return JSONResponse on errors instead of raising HTTPException
  (avoids issues with BaseHTTPMiddleware.call_next())
- Add comprehensive tests for McpSession lifecycle

Co-authored-by: nigiva <nigiva@users.noreply.github.com>
@GhimBoon GhimBoon force-pushed the fix/mcp-exit-stack-leak branch from b04a321 to 9bcc81d Compare March 2, 2026 14:54
@GhimBoon
Copy link
Contributor Author

GhimBoon commented Mar 2, 2026

Hi,

I've taken the feedback and re-architected the whole PR with goodness from #2182 (comment). I've also attributed @nigiva in the commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working size:L This PR changes 100-499 lines, ignoring generated files. unit-tests Has unit tests.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Disconnecting an MCP server causes a CPU core to get stuck at 100%

4 participants