Why is my multi-node NCCL training slow?

When the job runs but throughput is far below expectations, that is usually NCCL falling back from InfiniBand to TCP at startup. Job failures from mid-run port drops (QP errors and NCCL timeouts) are a separate problem. On rack-scale, NVLink-connected systems, Pod placement relative to the NVLink domain is also a common cause, covered in NVLink domain placement on rack-scale systems below. Quantify the slowdown. “Far below expectations” depends on your GPUs and interconnect, so compare against a measured baseline rather than a feeling. Run nccl-tests all_reduce on a small, known-good group of Nodes and record the bus bandwidth (busbw) it reports. That is your healthy baseline for the same message sizes: a job running correctly over InfiniBand lands close to it, while a job that has fallen back to TCP usually reports busbw an order of magnitude lower. If your cluster has GPU straggler detection enabled, the Slurm Job Metrics dashboard reports algorithmic bandwidth (AlgoBW) and bus bandwidth (BusBW) continuously from live jobs, so you can compare against the baseline without stopping the job to run nccl-tests. When InfiniBand is the cause, the slowness is usually one of these: InfiniBand is not actually being used, the InfiniBand interfaces are down, or NCCL is missing the environment variables that point it at the IB HCA. Quick diagnostic. From inside your training Pod, run ibstat. Each port should show State: Active and Physical state: LinkUp. If any interfaces show Down or Disabled, it is an infrastructure issue. Contact support with the Node names and ibstat output. Common causes:

Pod spec is missing rdma/ib: 1. Without the resource request, the Pod gets a Node without InfiniBand access and NCCL falls back to TCP.
NCCL environment variables are missing or incorrect. CoreWeave’s documented values are NCCL_SOCKET_IFNAME=eth0, NCCL_IB_HCA=ibp, UCX_TLS=tcp, UCX_NET_DEVICES=eth0.
Node Pool does not have InfiniBand. See InfiniBand and RoCE labels for the labels to look for.

For the complete Pod YAML and NCCL configuration, see Use GPUDirect RDMA with InfiniBand. To inspect which transport NCCL chose, set NCCL_DEBUG=INFO and look for NET/IB (good) compared to NET/Socket (TCP fallback).

NVLink domain placement on rack-scale systems

On rack-scale, NVLink-connected instances such as NVL72, GPUs communicate over NVLink within an NVLink domain and over the scaleout fabric (InfiniBand or RoCE) beyond it. NCCL is fastest when collectives stay inside the NVLink domain, so a job whose Pods are spread across domains can route traffic that should run over NVLink onto the slower scaleout fabric instead. When this happens, throughput drops even though InfiniBand is healthy and every cluster-wide check passes. Align scheduling with physical connectivity so related Pods land in the same NVLink domain. Use the nvidia.com/gpu.clique label as a Pod affinity topologyKey, and see IMEX overview for how NVLink domains, partitions, and placement work on CoreWeave.

Isolating a bad Node on large jobs

If you are training across hundreds or thousands of GPUs and the cluster-wide checks above pass, the cause may be a small group of Nodes with degraded HCAs rather than a configuration issue. Split your nodelist in half, run nccl-tests on each half, and keep bisecting the half that performs worse until you identify the rack or Nodes contributing to the slowdown. Then contact support with the Node names.

Administrator

​NVLink domain placement on rack-scale systems

​Isolating a bad Node on large jobs

NVLink domain placement on rack-scale systems

Isolating a bad Node on large jobs