Skip to content

Conversation

@specture724
Copy link
Collaborator

Act a TCP Store based barrier in ParameterServer.update method. Rewrite the logic of process management. Deprecate the logic if self._rank not in ranks: return. Do a _store_based_barrier among all ranks to make sure a synchronization before they quit update method. Also all communication are done in a sub group, deprecating _get_bcast_rank_map and init_process_group_for_ranks methods

@specture724 specture724 changed the title Fix/tcp store group Fix/tcp store based barrier Nov 11, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the process group management in the ParameterServer.update method by introducing a TCP store-based barrier and using PyTorch's subgroup functionality instead of custom process group initialization for rank subsets.

Key changes:

  • Replaced custom init_process_group_for_ranks with dist.new_group() for creating rank subgroups
  • Added store_based_barrier method to synchronize all ranks using TCP store
  • Removed _get_bcast_rank_map function, now using global ranks directly with subgroups
  • All collective operations now use the ranks_group parameter

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@specture724 specture724 changed the title Fix/tcp store based barrier fix: use tcp store_based_barrier to control p2p update synchronization Nov 11, 2025
if ranks_group:
dist.destroy_process_group(ranks_group)
if self._auto_pg and dist.is_initialized():
dist.destroy_process_group()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? I think the GPU mem from NCCL may be released after dist.destroy_process_group(ranks_group) and dist.destroy_process_group() may not be necessary. Please test and check whether my view is correct.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. If only dist.destroy_process_group(ranks_group) is called, 1306MB will remain, while 980MB for both are called

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, we can only call dist.destroy_process_group(). When no arguments are given, it will destroy all process groups, including ranks_group

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the case where auto_pg == False, I think we'd better not to leave ranks_group not destroyed

self.init_process_group()

# if both ranks is None or [], it will use fully broadcast to update to all ranks
ranks_group = dist.new_group(ranks if ranks else None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will cause compatible problem. If user does not use auto pg and init process group only in ranks by using the same logic in like init_process_group_for_ranks, this will break.
But whether we should be compatible with this situation may need to discuss

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would assume that if the user initialize the PG by themselve, the ranks param should also correspond to the PG? In which case it should be OK?

Copy link
Collaborator

@blahgeek blahgeek Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm maybe I was wrong.

Is there any document about, in case of not _auto_pg, which ranks should form a PG & which ranks should call update & the meaning of ranks?

@specture724 specture724 self-assigned this Nov 17, 2025
@blahgeek blahgeek merged commit baf6f61 into MoonshotAI:main Dec 11, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants