Skip to content

Conversation

@alpha-baby
Copy link

reproduce

Machine information

$nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    NODE    NODE    SYS     SYS     0-47,96-143     0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    PIX     NODE    SYS     SYS     0-47,96-143     0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    NODE    NODE    SYS     SYS     0-47,96-143     0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    NODE    PIX     SYS     SYS     0-47,96-143     0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     PIX     NODE    48-95,144-191   1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     NODE    NODE    48-95,144-191   1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     NODE    PIX     48-95,144-191   1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     NODE    NODE    48-95,144-191   1               N/A
NIC0    NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYS      X      NODE    SYS     SYS
NIC1    NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS     NODE     X      SYS     SYS
NIC2    SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    SYS     SYS      X      NODE
NIC3    SYS     SYS     SYS     SYS     NODE    NODE    PIX     NODE    SYS     SYS     NODE     X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_bond_0
  NIC1: mlx5_bond_1
  NIC2: mlx5_bond_2
  NIC3: mlx5_bond_3

NIC info

hca_id: mlx5_bond_0
        transport:                      InfiniBand (0)
        fw_ver:                         32.39.3920
        node_guid:                      e09d:7303:0024:7630
        sys_image_guid:                 e09d:7303:0024:7630
        vendor_id:                      0x02c9
        vendor_part_id:                 41692
        hw_ver:                         0x1
        board_id:                       MT_0000001093
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_bond_1
        transport:                      InfiniBand (0)
        fw_ver:                         32.39.3920
        node_guid:                      e09d:7303:0024:79a0
        sys_image_guid:                 e09d:7303:0024:79a0
        vendor_id:                      0x02c9
        vendor_part_id:                 41692
        hw_ver:                         0x1
        board_id:                       MT_0000001093
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_bond_2
        transport:                      InfiniBand (0)
        fw_ver:                         32.39.3920
        node_guid:                      e09d:7303:0027:1b08
        sys_image_guid:                 e09d:7303:0027:1b08
        vendor_id:                      0x02c9
        vendor_part_id:                 41692
        hw_ver:                         0x1
        board_id:                       MT_0000001093
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_bond_3
        transport:                      InfiniBand (0)
        fw_ver:                         32.39.3920
        node_guid:                      e09d:7303:0024:6fbe
        sys_image_guid:                 e09d:7303:0024:6fbe
        vendor_id:                      0x02c9
        vendor_part_id:                 41692
        hw_ver:                         0x1
        board_id:                       MT_0000001093
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

env config:

NVSHMEM_ENABLE_NIC_PE_MAPPING=1
NVSHMEM_DEBUG_SUBSYS=INIT
NVSHMEM_IB_GID_INDEX=3
NVSHMEM_IB_SL=5
NVSHMEM_DEBUG=INFO
NVSHMEM_HCA_PE_MAPPING=mlx5_bond_0:1:2,mlx5_bond_1:1:2,mlx5_bond_2:1:2,mlx5_bond_3:1:2
NVSHMEM_IB_TRAFFIC_CLASS=16
NCCL_SOCKET_IFNAME=bond0
NCCL_NET_PLUGIN=
NCCL_IB_TIMEOUT=22
NCCL_IB_GID_INDEX=3
NCCL_SET_THREAD_NAME=1
NCCL_DEBUG_SUBSYS=INIT,TUNING,GRAPH
NCCL_IB_SL=5
NCCL_IB_TC=136
NCCL_IB_HCA=mlx5_bond
NCCL_IB_RETRY_CNT=7
NCCL_IB_QPS_PER_CONNECTION=8
NCCL_DEBUG=INFO

run deepep log file:
deep_ep_test.log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant