-
Notifications
You must be signed in to change notification settings - Fork 1
prov/sharp scaffolding #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: devel
Are you sure you want to change the base?
Conversation
ldorau
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 24 files at r1, all commit messages.
Reviewable status: 4 of 24 files reviewed, 2 unresolved discussions (waiting on @grom72)
Makefile.am line 6 at r1 (raw file):
# Copyright (c) 2018 Amazon.com, Inc. or its affiliates. All rights reserved. # (C) Copyright 2020 Hewlett Packard Enterprise Development LP # Copyright (c) 2022 Intel Corporation. All right reserved.
There is already Intel Copyright 3 lines above.
Code quote:
# Copyright (c) 2022 Intel Corporation. All right reserved.README.md line 219 at r1 (raw file):
The `off_sharp` provider is an utility provider that supports collective endpoints utilizing SHARP protocol for barier and allreduce operations.
... for the barier and the allreduce ...
Code quote:
for barier and allreduce
ldorau
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 24 files at r1.
Reviewable status: 6 of 24 files reviewed, 3 unresolved discussions (waiting on @grom72)
include/ofi_sharp.h line 70 at r1 (raw file):
#endif #endif /* _OFI_SHM_H_ */
_OFI_SHARP_H_
Code quote:
_OFI_SHM_H_fa83a1d to
0bb30c8
Compare
grom72
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 4 of 24 files reviewed, 3 unresolved discussions (waiting on @ldorau)
README.md line 219 at r1 (raw file):
Previously, ldorau (Lukasz Dorau) wrote…
... for the barier and the allreduce ...
Done.
include/ofi_sharp.h line 70 at r1 (raw file):
Previously, ldorau (Lukasz Dorau) wrote…
_OFI_SHARP_H_
Done.
0bb30c8 to
6f2c738
Compare
grom72
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 3 of 24 files reviewed, 3 unresolved discussions (waiting on @ldorau)
Makefile.am line 6 at r1 (raw file):
Previously, ldorau (Lukasz Dorau) wrote…
There is already Intel Copyright 3 lines above.
Done.
ldorau
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 3 files at r2, 1 of 2 files at r3, all commit messages.
Reviewable status: 6 of 24 files reviewed, 1 unresolved discussion (waiting on @grom72)
Makefile.am line 3 at r3 (raw file):
# # Copyright (c) 2016 Cisco Systems, Inc. All rights reserved. # Copyright (c) 2017-2018, 2022 Intel Corporation, Inc. All right reserved.
I think it should be 2017-2022
Code quote:
2017-2018, 2022
ldorau
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 24 files at r1.
Reviewable status: 10 of 24 files reviewed, 8 unresolved discussions (waiting on @grom72)
man/fi_sharp.7.md line 15 at r3 (raw file):
The SHARP provider is a collectives offload provider that can be used on Linux systems supporting SHARP protocol.
the SHARP protocol
Code quote:
SHARP protocolman/fi_sharp.7.md line 19 at r3 (raw file):
# SUPPORTED FEATURES This release contains an initial implementation of the SHM provider that
SHARP
Code quote:
SHMman/fi_sharp.7.md line 23 at r3 (raw file):
*Endpoint types* : The provider supports only endpoint type *FI_EP_COLLECTIVE*.
only one endpoint type: *FI_EP_COLLECTIVE*.
or
only the *FI_EP_COLLECTIVE* endpoint type.
Code quote:
only endpoint type *FI_EP_COLLECTIVE*.man/fi_sharp.7.md line 26 at r3 (raw file):
*Endpoint capabilities* : Endpoints cna support only fi_barrier and fi_allreduce operations.
can
Code quote:
cnaman/fi_sharp.7.md line 41 at r3 (raw file):
*MR registration mode* The provider implements FI_MR_VIRT_ADDR memory mode.
the FI_MR_VIRT_ADDR memory mode
Code quote:
FI_MR_VIRT_ADDR memory modeman/fi_sharp.7.md line 50 at r3 (raw file):
The SHARP provider has hard-coded maximums for supported queue sizes and data transfers. These values are reflected in the related fabric attribute structures
structures. - dot at the end
Code quote:
structuresman/fi_sharp.7.md line 56 at r3 (raw file):
# RUNTIME PARAMETERS The *SHARP* provider checks for the following environment variables:
Please be consistent: The *SHARP* provider or The SHARP provider - the same in all places
Code quote:
The *SHARP* provider
ldorau
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 24 files at r1.
Reviewable status: 11 of 24 files reviewed, 8 unresolved discussions (waiting on @grom72)
67f4224 to
be4aeab
Compare
grom72
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 9 of 24 files reviewed, 8 unresolved discussions (waiting on @ldorau)
man/fi_sharp.7.md line 15 at r3 (raw file):
Previously, ldorau (Lukasz Dorau) wrote…
the SHARP protocol
Done.
Code quote:
SHARP protocol.man/fi_sharp.7.md line 19 at r3 (raw file):
Previously, ldorau (Lukasz Dorau) wrote…
SHARP
Done.
man/fi_sharp.7.md line 23 at r3 (raw file):
Previously, ldorau (Lukasz Dorau) wrote…
only one endpoint type: *FI_EP_COLLECTIVE*.
or
only the *FI_EP_COLLECTIVE* endpoint type.
Done.
man/fi_sharp.7.md line 26 at r3 (raw file):
Previously, ldorau (Lukasz Dorau) wrote…
can
Done.
man/fi_sharp.7.md line 41 at r3 (raw file):
Previously, ldorau (Lukasz Dorau) wrote…
the FI_MR_VIRT_ADDR memory mode
Done.
man/fi_sharp.7.md line 50 at r3 (raw file):
Previously, ldorau (Lukasz Dorau) wrote…
structures.- dot at the end
Done.
man/fi_sharp.7.md line 56 at r3 (raw file):
Previously, ldorau (Lukasz Dorau) wrote…
Please be consistent:
The *SHARP* providerorThe SHARP provider- the same in all places
Done.
ldorau
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 24 files at r1, 1 of 2 files at r3, 2 of 2 files at r4, all commit messages.
Reviewable status: 15 of 24 files reviewed, 1 unresolved discussion (waiting on @grom72)
be4aeab to
c9ba8f8
Compare
d168f84 to
ac48ad7
Compare
ldorau
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 14 of 34 files reviewed, 2 unresolved discussions (waiting on @grom72)
prov/sharp/Makefile.include line 11 at r8 (raw file):
prov/sharp/src/sharp_eq.c \ prov/sharp/src/sharp_ep.c \ prov/sharp/src/sharp_cq.c \
Fix alignment, please (screenshot from vim):
Code quote:
_sharp_files = \
include/ofi_sharp.h \
prov/sharp/src/sharp.h \
prov/sharp/src/sharp_attr.c \
prov/sharp/src/sharp_init.c \
prov/sharp/src/sharp_fabric.c \
prov/sharp/src/sharp_domain.c \
prov/sharp/src/sharp_eq.c \
prov/sharp/src/sharp_ep.c \
prov/sharp/src/sharp_cq.c \
prov/sharp/src/sharp_coll.c \
prov/sharp/src/sharp_progress.c
ac48ad7 to
2368690
Compare
grom72
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 12 of 34 files reviewed, 2 unresolved discussions (waiting on @ldorau)
Makefile.am line 3 at r3 (raw file):
Previously, ldorau (Lukasz Dorau) wrote…
I think it should be
2017-2022
Done.
prov/sharp/Makefile.include line 11 at r8 (raw file):
Done.
ldorau
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r8, 2 of 3 files at r9.
Reviewable status: 14 of 34 files reviewed, all discussions resolved
ldorau
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 13 of 35 files reviewed, 8 unresolved discussions (waiting on @grom72)
prov/sharp/src/sharp_cq.c line 87 at r10 (raw file):
if (!attr || !(attr->flags & FI_PEER)) { FI_WARN(provider, FI_LOG_CORE, "FI_PEER flag required\n"); return -EINVAL;
FI_EINVAL
Code quote:
EINVALprov/sharp/src/sharp_cq.c line 92 at r10 (raw file):
if (!peer_context || peer_context->size < sizeof(*peer_context)) { FI_WARN(provider, FI_LOG_CORE, "invalid peer CQ context\n"); return -EINVAL;
.
prov/sharp/src/sharp_domain.c line 169 at r10 (raw file):
FI_WARN(&sharp_prov, FI_LOG_CORE, "FI_PEER flag required\n"); return -EINVAL;
.
prov/sharp/src/sharp_domain.c line 175 at r10 (raw file):
FI_WARN(&sharp_prov, FI_LOG_CORE, "Invalid peer domain context\n"); return -EINVAL;
.
prov/sharp/src/sharp_ep.c line 188 at r10 (raw file):
FI_WARN(&sharp_prov, FI_LOG_CORE, "FI_PEER_TRANSFER mode required\n"); return -EINVAL;
.
prov/sharp/src/sharp_ep.c line 194 at r10 (raw file):
FI_WARN(&sharp_prov, FI_LOG_CORE, "Invalid peer transfer context\n"); return -EINVAL;
.
prov/sharp/src/sharp_eq.c line 76 at r10 (raw file):
if (!attr || !(attr->flags & FI_PEER)) { FI_WARN(&sharp_prov, FI_LOG_CORE, "FI_PEER flag required\n"); return -EINVAL;
.
prov/sharp/src/sharp_eq.c line 81 at r10 (raw file):
if (!peer_context || peer_context->size < sizeof(*peer_context)) { FI_WARN(&sharp_prov, FI_LOG_CORE, "invalid peer EQ context\n"); return -EINVAL;
.
a18df51 to
eca015c
Compare
6756031 to
9e6a2ca
Compare
9e6a2ca to
b86b6cb
Compare
5af5f80 to
429572a
Compare
grom72
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 24 files at r1, 1 of 3 files at r2, 2 of 2 files at r4, 1 of 3 files at r9, 1 of 5 files at r12.
Reviewable status: 12 of 37 files reviewed, 8 unresolved discussions (waiting on @ldorau)
prov/sharp/src/sharp_cq.c line 87 at r10 (raw file):
Previously, ldorau (Lukasz Dorau) wrote…
FI_EINVAL
Done.
prov/sharp/src/sharp_cq.c line 92 at r10 (raw file):
Previously, ldorau (Lukasz Dorau) wrote…
.
Done.
prov/sharp/src/sharp_domain.c line 169 at r10 (raw file):
Previously, ldorau (Lukasz Dorau) wrote…
.
Done.
prov/sharp/src/sharp_domain.c line 175 at r10 (raw file):
Previously, ldorau (Lukasz Dorau) wrote…
.
Done.
prov/sharp/src/sharp_ep.c line 188 at r10 (raw file):
Previously, ldorau (Lukasz Dorau) wrote…
.
Done.
prov/sharp/src/sharp_ep.c line 194 at r10 (raw file):
Previously, ldorau (Lukasz Dorau) wrote…
.
Done.
prov/sharp/src/sharp_eq.c line 76 at r10 (raw file):
Previously, ldorau (Lukasz Dorau) wrote…
.
Done.
prov/sharp/src/sharp_eq.c line 81 at r10 (raw file):
Previously, ldorau (Lukasz Dorau) wrote…
.
Done.
0df5f87 to
5d885a3
Compare
grom72
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 14 files at r11.
Reviewable status: 11 of 37 files reviewed, 8 unresolved discussions (waiting on @ldorau)
ldorau
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 14 files at r11.
Reviewable status: 14 of 37 files reviewed, all discussions resolved (waiting on @grom72)
5d885a3 to
f9fd745
Compare
f9fd745 to
cd0a9b4
Compare
grom72
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 11 files at r7, 1 of 14 files at r11, 2 of 7 files at r13, 2 of 7 files at r15, 3 of 4 files at r16.
Reviewable status: 22 of 37 files reviewed, all discussions resolved (waiting on @ldorau)
TODO.md files defines next development steps. The list is ordered by priority. Signed-off-by: Tomasz Gromadzki <[email protected]>
…r is available * FOR TEST PURPOSE ONLY - TEMPORARY SOLUTION * To let all collective tests pass let's force rxm to provide only offload collective domain capabilities. THIS IS ONLY FOR TEST PPURPOSE. In the future, tests for offload provider shall be executed with OFI_OFFLOAD_PROV_ONLY flag set in fi_query_collective() call. Signed-off-by: Tomasz Gromadzki <[email protected]>
Signed-off-by: Tomasz Gromadzki <[email protected]>
Both collective operation implemented ba colling peer collective operation and transparently passing completion back to peer CQ Signed-off-by: Tomasz Gromadzki <[email protected]>
cd0a9b4 to
27fd202
Compare
grom72
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 24 files at r1, 1 of 3 files at r2, 1 of 2 files at r3, 1 of 8 files at r6, 6 of 11 files at r7, 4 of 14 files at r11, 1 of 5 files at r12, 1 of 7 files at r15, 5 of 11 files at r17, 6 of 6 files at r18, all commit messages.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on @grom72)
fi_join_collective implementation with usage of coll_op's state machine Signed-off-by: Tomasz Gromadzki <[email protected]>
Signed-off-by: Tomasz Gromadzki <[email protected]>
Add mocks for sharp_coll_init and sharp_coll_finalize. Signed-off-by: Lukasz Dorau <[email protected]>
In order to mock SHARP functions run: $ ./configure --enable-sharp=mocked or $ ./configure --enable-sharp=auto Running: $ ./configure --enable-sharp=yes make use of SHARP libraries instead of mocks. Signed-off-by: Lukasz Dorau <[email protected]>
27fd202 to
7770931
Compare
grom72
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 6 files at r19, 5 of 5 files at r20, all commit messages.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on @grom72)
If a posted receive matches with a saved receive, we may need to increment the rx counter. Set the rx counter increment callback to match that of the posted receive. This fixes an assert in xnet_cntr_inc() accessing a NULL cntr_inc function pointer. Program received signal SIGABRT, Aborted. 0x0000155552d4d37f in raise () from /lib64/libc.so.6 #0 0x0000155552d4d37f in raise () from /lib64/libc.so.6 #1 0x0000155552d37db5 in abort () from /lib64/libc.so.6 #2 0x0000155552d37c89 in __assert_fail_base.cold.0 () from /lib64/libc.so.6 #3 0x0000155552d45a76 in __assert_fail () from /lib64/libc.so.6 #4 0x00001555522967f9 in xnet_cntr_inc (ep=0x6e4c70, xfer_entry=0x6f7a30) at prov/tcp/src/xnet_cq.c:347 #5 0x0000155552296836 in xnet_report_cntr_success (ep=0x6e4c70, cq=0x6ca930, xfer_entry=0x6f7a30) at prov/tcp/src/xnet_cq.c:354 #6 0x000015555229970d in xnet_complete_saved (saved_entry=0x6f7a30) at prov/tcp/src/xnet_progress.c:153 #7 0x0000155552299961 in xnet_recv_saved (saved_entry=0x6f7a30, rx_entry=0x6f7840) at prov/tcp/src/xnet_progress.c:188 #8 0x00001555522946f8 in xnet_srx_tag (srx=0x6dd1c0, recv_entry=0x6f7840) at prov/tcp/src/xnet_srx.c:445 #9 0x0000155552294bb1 in xnet_srx_trecv (ep_fid=0x6dd1c0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at prov/tcp/src/xnet_srx.c:558 ofiwg#10 0x000015555228f60e in fi_trecv (ep=0x6dd1c0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at ./include/rdma/fi_tagged.h:91 ofiwg#11 0x00001555522900a7 in xnet_rdm_trecv (ep_fid=0x6d9fe0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at prov/tcp/src/xnet_rdm.c:212 Signed-off-by: Sean Hefty <[email protected]>

Prov sharp scaffolding
This change is