rdma/ib: 1 (the same resource is used for both InfiniBand and RoCE Nodes). Set the NCCL and UCX environment variables that CoreWeave documents: NCCL_SOCKET_IFNAME=eth0, NCCL_IB_HCA=ibp, UCX_TLS=tcp, and UCX_NET_DEVICES=eth0. These ensure NCCL uses the InfiniBand HCA instead of falling back to TCP. Schedule the Pods onto Nodes with InfiniBand support, and verify from inside the Pod with ibstat that interfaces show State: Active and Physical state: LinkUp.
For full details and a complete Pod YAML, see Use GPUDirect RDMA with InfiniBand. If performance is poor after configuration, see Why is my multi-node NCCL training slow?.
Administrator