Skip to content

Conversation

@paklui
Copy link

@paklui paklui commented Dec 4, 2025

Motivation

add oob_tcp_if_include for OMPI and define NCCL_SOCKET_IFNAME parameters for RCCL to multinode

Technical Details

I tried to run earlier on a couple of MI350 systems for multi-node RCCL, run into this issue that the default socket for RCCL and OMPI does not work.
Since there is already oob_port specified for btl_tcp_if_include, could we also use it for oob_tcp_if_include for OMPI and NCCL_SOCKET_IFNAME that I could work around this issue.

The error I get is the following if I don't use the socket.
The test would hang for awhile and time out with the following:

node-h30-08:395097:395195 [6] NCCL INFO RAS client listening socket at ::1<28028>
[2025-12-04 19:41:31] node-h28-08:1921104:1921160 [4] /longer_pathname_so_that_rpms_can_support_packaging_the_debug_info_for_all_os_profiles/src/out/rhel-9.5/9.5/build/rccl/hipify/src/misc/socket.cc:589 NCCL WARN socketPollConnect poll() returned 1, no POLLOUT events
node-h28-08:1921104:1921160 [4] NCCL INFO /longer_pathname_so_that_rpms_can_support_packaging_the_debug_info_for_all_os_profiles/src/out/rhel-9.5/9.5/build/rccl/hipify/src/misc/socket.cc:641 -> 2
node-h28-08:1921104:1921160 [4] NCCL INFO /longer_pathname_so_that_rpms_can_support_packaging_the_debug_info_for_all_os_profiles/src/out/rhel-9

Test Plan

with this change, I am able to execute all combination of rccl multinode tests using:

source ./test_venv/bin/activate
pytest -vvv --log-file=/tmp/test.log -s ./tests/rccl/rccl_multinode_cvs.py --cluster_file input/cluster_file/cluster.json  --config_file input/config_file/rccl/rccl_config.json --html=./rccl.html --capture=tee-sys --self-contained-html

Test Result

The RCCL multinode test pass. Almost all pass using the latest git.
ufw is a separate issue since it does not work on CentOS 9 (default OS for GT 1.5 systems)

---------------------------------------- Generated html report: file:///apps/paklui/cv350/cvs/rccl.html ----------------------------------------
=============================================================== short test summary info ================================================================
FAILED tests/rccl/rccl_multinode_cvs.py::test_disable_firewall - Failed: Following FAILURES seen - ['Service ufw not disabled properly on node node-h30-08', 'Service ufw not disabled properly on node node-h28-08']
====================================================== 1 failed, 47 passed in 1219.50s (0:20:19) =======================================================

Submission Checklist

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant