Skip to content

Conversation

@ldorau
Copy link
Member

@ldorau ldorau commented Dec 6, 2022

This patch requires the original SHARP header in /usr/include/mellanox/sharp.h

Signed-off-by: Lukasz Dorau [email protected]

Requires:


This change is Reviewable

grom72 and others added 10 commits December 5, 2022 14:01
coll_cq implementation can be reused by other collective providers.

Signed-off-by: Tomasz Gromadzki <[email protected]>
…initialization

It is rxm provider responsability to initialize collective offload provider's fabric.
Otherwise collective offload functionality will not be available

Signed-off-by: Tomasz Gromadzki <[email protected]>
…IDER

FI_OFFLOAD_PROVIDER environment variable shall be set to offload provider name
to instruct libcabric to setup and use particular provider.

Signed-off-by: Tomasz Gromadzki <[email protected]>
Peer provider must create peer_eq for offload provider, to allow offload provider
reporting events to peer provider.

Signed-off-by: Tomasz Gromadzki <[email protected]>
Offload provider may execute collective operations via util_coll provider.
It must call fi_join() operation to get struct mc required for collective operations.
It can only call fi_join() on it's peer provider (e.g. rxm). FI_PEER flag is used
to inform peer provider to coll fi_join() operation for util_coll_ep

Signed-off-by: Tomasz Gromadzki <[email protected]>
offload_coll_mask value is calculated based on the actual offload capabilities
confirmed by fi_query_collective().

Signed-off-by: Tomasz Gromadzki <[email protected]>
This patch requires the original SHARP header in
/usr/include/mellanox/sharp.h

Signed-off-by: Lukasz Dorau <[email protected]>
@ldorau ldorau marked this pull request as draft December 6, 2022 20:53
grom72 pushed a commit that referenced this pull request Mar 24, 2023
If a posted receive matches with a saved receive, we may need to
increment the rx counter.  Set the rx counter increment callback
to match that of the posted receive.  This fixes an assert in
xnet_cntr_inc() accessing a NULL cntr_inc function pointer.

Program received signal SIGABRT, Aborted.
0x0000155552d4d37f in raise () from /lib64/libc.so.6
#0  0x0000155552d4d37f in raise () from /lib64/libc.so.6
#1  0x0000155552d37db5 in abort () from /lib64/libc.so.6
#2  0x0000155552d37c89 in __assert_fail_base.cold.0 () from /lib64/libc.so.6
#3  0x0000155552d45a76 in __assert_fail () from /lib64/libc.so.6
#4  0x00001555522967f9 in xnet_cntr_inc (ep=0x6e4c70, xfer_entry=0x6f7a30) at prov/tcp/src/xnet_cq.c:347
#5  0x0000155552296836 in xnet_report_cntr_success (ep=0x6e4c70, cq=0x6ca930, xfer_entry=0x6f7a30) at prov/tcp/src/xnet_cq.c:354
#6  0x000015555229970d in xnet_complete_saved (saved_entry=0x6f7a30) at prov/tcp/src/xnet_progress.c:153
#7  0x0000155552299961 in xnet_recv_saved (saved_entry=0x6f7a30, rx_entry=0x6f7840) at prov/tcp/src/xnet_progress.c:188
#8  0x00001555522946f8 in xnet_srx_tag (srx=0x6dd1c0, recv_entry=0x6f7840) at prov/tcp/src/xnet_srx.c:445
#9  0x0000155552294bb1 in xnet_srx_trecv (ep_fid=0x6dd1c0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at prov/tcp/src/xnet_srx.c:558
ofiwg#10 0x000015555228f60e in fi_trecv (ep=0x6dd1c0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at ./include/rdma/fi_tagged.h:91
ofiwg#11 0x00001555522900a7 in xnet_rdm_trecv (ep_fid=0x6d9fe0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at prov/tcp/src/xnet_rdm.c:212

Signed-off-by: Sean Hefty <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants