Skip to content

Conversation

@hzhou
Copy link
Contributor

@hzhou hzhou commented Nov 14, 2025

Pull Request Description

Collecting misc. preparation commits to prepare for CSEL redesign.

[skip warnings]

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

@hzhou hzhou force-pushed the 2511_coll_prep branch 2 times, most recently from c5e6c36 to ad46f16 Compare November 14, 2025 22:11
Do not hide the script. Move it to maint/ as the reset of the autogen
scripts.
ipc_p2p.h references MPIDI_POSIX_am_eager_limit, which is defined in
shm_am.h.

Currently it is probably pulled in by `shm_coll.h`, which we will remove
in near future.
The fallback collectives (e.g. MPIR_Bcast_fallback) are manual "auto"
functions that may not be the best algorithms for the system, but
are sufficient for internal usages during init and object
constricutions.

Not all collective types have fallbacks defined since internally we use
limited types. We'll define new fallback routines when we need them in
the future.

This prepares for the revamp of CSEL.
They are manual "auto" algorithms that selects from a small set of
algorithms that are suitable for internal usages, e.g. during init,
finalize, and communicator constructions.
@hzhou
Copy link
Contributor Author

hzhou commented Nov 21, 2025

test:mpich/ch3/most
test:mpich/ch4/most

@hzhou hzhou requested a review from raffenet November 24, 2025 15:51
Copy link
Contributor

@raffenet raffenet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple comments.

}

/* allgather is needed to exchange all the IPC handles */
/* FIXME: call MPIR_Coll_auto */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we want MPIR_Coll_auto in this instance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we don't. The comment slipped through from my early draft. I'll remove it.

MPID_THREAD_CS_EXIT(VCI, MPIDI_VCI_LOCK(vci));
need_unlock = 0;
mpi_errno = MPIR_Barrier(win->comm_ptr, MPIR_COLL_ATTR_SYNC);
mpi_errno = MPIR_Barrier_fallback(win->comm_ptr, MPIR_COLL_ATTR_SYNC);
Copy link
Contributor

@raffenet raffenet Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I contend that this barrier is perf critical for RMA apps. It should be the case that we'll never reach it before selection is set, but I could be missing something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design for calling fallback interfaces is when we don't desire it subject to runtime selection. It is not part of the design that they are performance-insufficient. We can and should optimize the fallback routine to the point that the performance is sufficient. The fallback routine is designed to be a manual selection auto function. For now, most fallback routines just call a single algorithm because they are sufficient for the current usages. We can improve it in the future as necessary.

I think in this case, the performance from dissemination algorithm is sufficient for the fence synchronization.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this instance is unlike the rest in this PR. Fence is essentially barrier synchronization. If MPI_Barrier is subject to runtime selection, I don't see why MPI_Win_fence should not. A vendor may optimize their system for barrier, IIRC BlueGene did this, and we should not prevent that from being used in RMA apps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the design purpose of collective tuning (e.g. tuning a Barrier algorithm) is to affect MPI_Barrier call (from user). If it also affects the behavior of MPI_Win_fence, I would say it is an unintended side effect. I don't disagree that a runtime tuning of MPI_Win_fence is a nice option, but adopting a side-effect tuning is not a good design.

A vendor may optimize their system for barrier, IIRC BlueGene did this, and we should not prevent that from being used in RMA apps.

Certainly. This is inside a ch4 mpidig routine. Vendor can modify it or implement their own fence algorithm.

Now, having one vendor or system using their specific Barrier algorithm and thus potentially affect another vendor's Win_fence algorithm, I think that would be problematic.

Your main argument is we are preventing something. What exactly do you think we are preventing? Then let's discuss how to enable that specific goal properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the semantics of MPI_Win_fence is a memory fence, not a barrier. A barrier is an implementation side effect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your main argument is we are preventing something. What exactly do you think we are preventing? Then let's discuss how to enable that specific goal properly.

I want to enable runtime algorithm selection of the collective synchronization explicitly required by MPI_Win_fence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants