Skip to content

Conversation

@specture724
Copy link
Collaborator

resolve #38

send an Exception to PS when update_weights_from_ipc execute run failed. _update_per_bucket needs to ensure all processes not to get an Exception before continuing next updating iteration

@specture724 specture724 requested review from Copilot and weixiao-huang and removed request for Copilot October 29, 2025 10:01
@specture724 specture724 self-assigned this Oct 29, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements error handling and recovery mechanisms for distributed weight updates in the checkpoint engine. The changes add support for detecting and handling remote errors during weight updates, ensuring proper cleanup even when errors occur.

Key changes:

  • Added error handling in worker processes to send exceptions back through ZMQ sockets
  • Modified the parameter server to detect remote errors via non-empty responses and coordinate error recovery using all-reduce
  • Moved process group cleanup and cache clearing into a finally block to ensure proper resource cleanup on both success and failure paths

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
checkpoint_engine/worker.py Wrapped weight update execution in try-except to send exceptions back via ZMQ socket
checkpoint_engine/ps.py Added error detection via response checking, coordinated error handling with all-reduce, and moved cleanup to finally block
tests/test_update.py Added comprehensive test infrastructure for error scenarios with new test cases and subprocess-based execution
Comments suppressed due to low confidence (2)

tests/test_update.py:62

  • Variable weights is not used.
        weights = weights  # Do some fake processing

tests/test_update.py:62

  • This assignment assigns a variable to itself.
        weights = weights  # Do some fake processing

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@specture724 specture724 force-pushed the fix/quit-when-error branch 2 times, most recently from 0fd4fe7 to 39807ce Compare November 10, 2025 06:37
@specture724 specture724 changed the title Fix/quit when error fix: force ps to quit when error occur during updating Nov 12, 2025
@blahgeek blahgeek self-requested a review November 17, 2025 08:56
@weixiao-huang
Copy link
Collaborator

LGTM

1 similar comment
@SongXiaoXi
Copy link
Collaborator

LGTM

Copy link
Collaborator

@blahgeek blahgeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@weixiao-huang weixiao-huang merged commit 67b0020 into MoonshotAI:main Nov 19, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Quit when error occur during updating

4 participants