fix: use tcp store_based_barrier to control p2p update synchronization #51

specture724 · 2025-11-11T09:35:07Z

Act a TCP Store based barrier in ParameterServer.update method. Rewrite the logic of process management. Deprecate the logic if self._rank not in ranks: return. Do a _store_based_barrier among all ranks to make sure a synchronization before they quit update method. Also all communication are done in a sub group, deprecating _get_bcast_rank_map and init_process_group_for_ranks methods

Copilot

Pull Request Overview

This PR refactors the process group management in the ParameterServer.update method by introducing a TCP store-based barrier and using PyTorch's subgroup functionality instead of custom process group initialization for rank subsets.

Key changes:

Replaced custom init_process_group_for_ranks with dist.new_group() for creating rank subgroups
Added store_based_barrier method to synchronize all ranks using TCP store
Removed _get_bcast_rank_map function, now using global ranks directly with subgroups
All collective operations now use the ranks_group parameter

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

checkpoint_engine/ps.py

weixiao-huang · 2025-11-16T14:10:04Z

checkpoint_engine/ps.py

+            if ranks_group:
+                dist.destroy_process_group(ranks_group)
+            if self._auto_pg and dist.is_initialized():
+                dist.destroy_process_group()


Is this necessary? I think the GPU mem from NCCL may be released after dist.destroy_process_group(ranks_group) and dist.destroy_process_group() may not be necessary. Please test and check whether my view is correct.

No. If only dist.destroy_process_group(ranks_group) is called, 1306MB will remain, while 980MB for both are called

But, we can only call dist.destroy_process_group(). When no arguments are given, it will destroy all process groups, including ranks_group

For the case where auto_pg == False, I think we'd better not to leave ranks_group not destroyed

weixiao-huang · 2025-11-16T14:13:34Z

checkpoint_engine/ps.py

+                self.init_process_group()
+
            # if both ranks is None or [], it will use fully broadcast to update to all ranks
+            ranks_group = dist.new_group(ranks if ranks else None)


This will cause compatible problem. If user does not use auto pg and init process group only in ranks by using the same logic in like init_process_group_for_ranks, this will break.
But whether we should be compatible with this situation may need to discuss

I would assume that if the user initialize the PG by themselve, the ranks param should also correspond to the PG? In which case it should be OK?

hmmm maybe I was wrong.

Is there any document about, in case of not _auto_pg, which ranks should form a PG & which ranks should call update & the meaning of ranks?

specture724 force-pushed the fix/tcp_store_group branch from 6cf03ff to e9325bb Compare November 11, 2025 09:37

specture724 requested review from Copilot and weixiao-huang November 11, 2025 09:39

Copilot started reviewing on behalf of specture724 November 11, 2025 09:39 View session

specture724 changed the title ~~Fix/tcp store group~~ Fix/tcp store based barrier Nov 11, 2025

Copilot finished reviewing on behalf of specture724 November 11, 2025 09:43

Copilot AI reviewed Nov 11, 2025

View reviewed changes

specture724 changed the title ~~Fix/tcp store based barrier~~ fix: use tcp store_based_barrier to control p2p update synchronization Nov 11, 2025

specture724 force-pushed the fix/tcp_store_group branch from 6980e59 to 40e03b3 Compare November 12, 2025 06:14

weixiao-huang reviewed Nov 16, 2025

View reviewed changes

specture724 self-assigned this Nov 17, 2025

specture724 force-pushed the fix/tcp_store_group branch from 5b68376 to 5c38df8 Compare November 19, 2025 08:21

specture724 added 4 commits December 11, 2025 06:14

fix: store based barrier for all processes' synchronization

0c8d3f2

refactor: rewrite update process management logic

f9b5a0f

misc: fix pr issues

7c6054c

fix: merge issues fixed

5a26fbf

specture724 force-pushed the fix/tcp_store_group branch from 3b6334f to 1bfca4a Compare December 11, 2025 06:26

misc

214fc86

specture724 force-pushed the fix/tcp_store_group branch from 1bfca4a to 214fc86 Compare December 11, 2025 06:38

blahgeek approved these changes Dec 11, 2025

View reviewed changes

blahgeek merged commit baf6f61 into MoonshotAI:main Dec 11, 2025
2 checks passed

fix: use tcp store_based_barrier to control p2p update synchronization #51

fix: use tcp store_based_barrier to control p2p update synchronization #51

Uh oh!

Conversation

specture724 commented Nov 11, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

weixiao-huang Nov 16, 2025

Choose a reason for hiding this comment

Uh oh!

specture724 Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

specture724 Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

specture724 Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

weixiao-huang Nov 16, 2025

Choose a reason for hiding this comment

Uh oh!

blahgeek Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

blahgeek Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

blahgeek Dec 10, 2025 •

edited

Loading