fix: harden MCP exit stack cleanup to prevent cross-task cancel-scope errors#2805
fix: harden MCP exit stack cleanup to prevent cross-task cancel-scope errors#2805GhimBoon wants to merge 1 commit intoChainlit:mainfrom
Conversation
There was a problem hiding this comment.
2 issues found across 3 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="backend/chainlit/server.py">
<violation number="1" location="backend/chainlit/server.py:1422">
P2: exit_stack_stored is set to True before on_mcp_connect runs; if the callback raises, the finally block skips cleanup and the partially initialized session remains stored, leaking the exit stack/session.</violation>
</file>
<file name="backend/chainlit/session.py">
<violation number="1" location="backend/chainlit/session.py:34">
P2: BaseExceptionGroup is referenced without a compatibility import, but the project supports Python 3.10 where BaseExceptionGroup is undefined. This will raise NameError during exception matching on 3.10, breaking MCP exit stack cleanup.</violation>
</file>
Since this is your first cubic review, here's how it works:
- cubic automatically reviews your code and comments on bugs and improvements
- Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
- Add one-off context when rerunning by tagging
@cubic-dev-aiwith guidance or docs links (includingllms.txt) - Ask questions if you need clarification on any suggestion
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
There was a problem hiding this comment.
1 issue found across 2 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="backend/chainlit/session.py">
<violation number="1" location="backend/chainlit/session.py:47">
P2: BaseExceptionGroup is no longer caught; since it inherits from BaseException (not Exception), BaseExceptionGroup-wrapped cancel-scope errors will bypass `except Exception`, so `_is_cancel_scope_error` never runs and cleanup errors can propagate again.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
|
Great contrib! @upa, @nigiva, @senj, @bartozl and @Brid9e seem to have run into this (from convo on the issue) and probably know more about it then I do. Quick review looks good but I'd prefer having a thumb's up from someone able and willing to replicate the problem and demonstrate the fix solves it! |
|
We run into this and currently patching it. The issue arises in Kubernetes deployment were CPU usage went to the roof |
|
We are also patching this, using the approach proposed here I'm not sure the one proposed in this PR will really solve the issue, from what I can I see, the |
2fc7646 to
b04a321
Compare
Each MCP connection now runs in its own asyncio.Task that owns the AsyncExitStack. The task enters transports, initializes the ClientSession, signals a ready event, then blocks on a stop event. On shutdown, the exit stack is closed in the same task that opened it, avoiding the cross-task cancel-scope corruption described in Chainlit#2182. Based on the solution proposed by @nigiva: Chainlit#2182 (comment) Changes: - Add McpSession dataclass (session.py) with stop_event, close(), and backward-compatible __iter__ for tuple unpacking - Rewrite connect_mcp (server.py) to launch connections in background tasks with ready/stop event coordination - Rewrite disconnect_mcp to use McpSession.close() - Update WebsocketSession.delete() to iterate and close McpSession objects - Catch BaseException (not just Exception) for streamablehttp transport errors (BaseExceptionGroup inherits from BaseException) - Use nested try/finally to ensure exit stack cleanup in all code paths - Return JSONResponse on errors instead of raising HTTPException (avoids issues with BaseHTTPMiddleware.call_next()) - Add comprehensive tests for McpSession lifecycle Co-authored-by: nigiva <nigiva@users.noreply.github.com>
b04a321 to
9bcc81d
Compare
|
Hi, I've taken the feedback and re-architected the whole PR with goodness from #2182 (comment). I've also attributed @nigiva in the commit. |
Summary
This change fixes intermittent MCP connection/disconnection failures that surfaced as:
Root cause
MCP AsyncExitStack cleanup was not consistently guaranteed in all failure paths, and some exceptions are wrapped in BaseExceptionGroup, which prevented targeted suppression/handling of the known cancel-scope mismatch case.
What changed
Validation
Impact
Fixes #2182
Summary by cubic
Run each MCP connection in its own background task that owns and closes its AsyncExitStack to eliminate cross-task cancel-scope errors, cleanup leaks, and intermittent 400s across connect, disconnect, and reconnect. Successful connections behave the same; shutdown and reconnection are now reliable.
Bug Fixes
Refactors
Written for commit 9bcc81d. Summary will update on new commits.