Skip to content

Overflow in memheap_base_mkey.c mca_memheap_modex_recv_all #13621

@rdfriese

Description

@rdfriese

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

v5.0.9

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Installed via git clone

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

git submodule status
+d30c15e65f3e6f154807424dd428698fe9fbad3a 3rd-party/openpmix (v5.0.10rc1-2-gd30c15e6)
+5ad79eb285023d1dcca472ccba9de5987b51cc27 3rd-party/prrte (psrvr-v2.0.0rc1-5048-g5ad79eb285)
+3064f7bd191b49a5a5554170ef7be4762246b5ee config/oac (heads/main)

Please describe the system on which you are running

  • Operating system/version: Rocky Linux 8.5
  • Computer hardware: 2xAMD EPYC 7543 256 GB Ram
  • Network type: Mellanox HDR-100 ConnectX-6 InfiniBand

Details of the problem

There seems to be an overflow when the number of segments on each PE is different in

for (j = 0; j < memheap_map->n_segments; j++) {
map_segment_t *s;
s = &memheap_map->mem_segs[j];
if (NULL != s->mkeys_cache[i]) {

Some sample output from running with debug outputs to show differences in number of segments (2 nodes with 8 processes per node):

...
[j018:67717] ../../../../oshmem/mca/memheap/base/memheap_base_mkey.c:587 - mca_memheap_modex_recv_all() local keys packed into 1190 bytes, 35 segments 
[j018:67718] ../../../../oshmem/mca/memheap/base/memheap_base_mkey.c:587 - mca_memheap_modex_recv_all() local keys packed into 1224 bytes, 36 segments
[j019:122058] ../../../../oshmem/mca/memheap/base/memheap_base_mkey.c:587 - mca_memheap_modex_recv_all() local keys packed into 1088 bytes, 32 segments
...

The problem is occurring (using the example above) when [j018:67717] is trying to unpack mkeys from [j019:122058]. [j018:67717] expects to iterate over 35 segments but it only receives the 32 segments from [j019:122058], resulting in an overflow and segmentation fault error.

I'm not familiar enough with why all the mkeys need to be exchanged, nor precisely why PEs are having different numbers of mapped segments, so I have naively hacked the loop to only send the first key, which on my system I believe corresponds to the dynamically created symmetric heap and have successfully executed with no errors. This didn't seem like a proper sustainable solution, hence the bug report rather than a pull request.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions