Skip to content

Conversation

@SongXiaoXi
Copy link
Collaborator

Since #43 we have been catching
remote exceptions instead of letting them crash the process. However, only the
exception message was propagated back and logged, and the original traceback
was silently dropped. This makes it hard to debug issues that happen in remote workers.

This PR changes the error handling to also return the full remote traceback
alongside the exception object, and ensures the parameter server logs it.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves error handling in the distributed checkpoint engine by propagating full exception tracebacks from remote workers to the parameter server, rather than just exception messages. This enhancement addresses a debugging issue introduced in PR #43 where remote exceptions were being caught but their tracebacks were silently dropped.

Key changes:

  • Workers now format full exception tracebacks as strings using traceback.format_exception() before sending them to the parameter server
  • Parameter server logs the complete traceback information instead of just exception type and message
  • Error messages are now more informative for debugging remote worker failures

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
checkpoint_engine/worker.py Added traceback import and modified exception handlers to format and send complete traceback strings instead of exception objects
checkpoint_engine/ps.py Updated error response handling to receive and log traceback strings instead of parsing exception objects

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@specture724
Copy link
Collaborator

LGTM

@weixiao-huang weixiao-huang merged commit 44d5670 into MoonshotAI:main Dec 5, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants