Use GPUDirect RDMA with InfiniBand

In this guide, learn how to use GPUDirect Remote Direct Memory Access (RDMA) with InfiniBand at CoreWeave, and how to test it with the NVIDIA Collective Communications Library (NCCL). GPUDirect RDMA enables direct memory access between GPUs across Nodes over InfiniBand, which reduces latency and frees CPU cycles for distributed training and other multi-Node GPU workloads. This guide is for users running multi-Node GPU workloads on CKS or Slurm who need high-bandwidth, low-latency communication between GPUs.

Prerequisites

CoreWeave supports GPUDirect RDMA over InfiniBand for some GPU instance types. To use this feature, you must:

Select a Node Pool with InfiniBand support.
Install NCCL and the OpenFabrics Enterprise Distribution (OFED) driver in the Pod image.
Configure the Pods to use GPUDirect RDMA.

Select a Node Pool with InfiniBand support

To use GPUDirect RDMA, make sure the Node Pool has Nodes with InfiniBand, as shown in our list of GPU instance types. All Nodes with InfiniBand have the required kernel drivers pre-installed. CKS manages all the required driver and operator dependencies. To avoid Node instability, you should not install other driver management tools. Once you have a Node Pool that supports InfiniBand, the next section covers the Pod-level configuration that opts workloads into GPUDirect RDMA.

Configure the Pods

Configure your Pods to use GPUDirect RDMA over InfiniBand by following these steps:

Set the value of spec.containers.resources.requests.rdma/ib to 1. This value doesn’t indicate the number of InfiniBand devices requested. It works as a boolean to schedule Pods onto servers with InfiniBand support. Kubernetes schedules resources through requests and limits. When you specify only limits, Kubernetes sets requests to the same amount as the limit. For more information, see Resource management for Pods and containers in the Kubernetes documentation. For a full YAML example showing how to set the rdma/ib value in the Pod spec for both requests and limits, see Kubernetes example.
Configure the Pods to use GPUDirect RDMA by setting these environment variables:
- NCCL_SOCKET_IFNAME: The network interface name to use for NCCL communication. Set this to the InfiniBand interface name.
- NCCL_IB_HCA: The InfiniBand host channel adapter (HCA) to use for NCCL communication.
- UCX_NET_DEVICES: The network devices to use for Unified Communication X (UCX) communication. Set this to the InfiniBand interface name.

For Kubernetes and Slurm examples, see Kubernetes example and Slurm example.

Optional: Enable extended logging with the NCCL_DEBUG environment variable. To increase the verbosity of NCCL’s logging, set the NCCL_DEBUG environment variable to INFO for extra debug information. This helps diagnose issues with RDMA support, but it increases the log file size, so disable it when testing is complete. See NCCL_DEBUG in the NCCL documentation for more logging options.

Kubernetes example

When you deploy a Kubernetes Pod in the cluster, use the highlighted lines in this example to set the rdma/ib value in the Pod spec for both requests and limits, and to set the required environment variables.

Kubernetes example with debug logging

# [...]
spec:
  containers:
  - name: example
    resources:
      requests:
        cpu: 10
        memory: 10Gi
        rdma/ib: 1
        nvidia.com/gpu: 8
      limits:
        cpu: 10
        memory: 10Gi
        rdma/ib: 1
        nvidia.com/gpu: 8
  env:
    - name: NCCL_SOCKET_IFNAME
      value: eth0
    - name: NCCL_IB_HCA
      value: ibp
    - name: UCX_TLS
      value: tcp
    - name: UCX_NET_DEVICES
      value: eth0
    - name: NCCL_DEBUG
      value: INFO
# [...]

Setting NCCL_DEBUG to INFO enables extended logging. Remove it if you don’t need extended logging.

Slurm example

When you deploy a Slurm job, use the highlighted lines in this example to set the required environment variables. Remove NCCL_DEBUG unless you need extended logging.

Example Slurm sbatch script

#!/bin/bash

#SBATCH --partition h100
#SBATCH --nodes 16
#SBATCH --ntasks-per-node 8
#SBATCH --gpus-per-node 8
# [...] other SBATCH options, as needed

export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_HCA=ibp
export UCX_TLS=tcp
export UCX_NET_DEVICES=eth0
export NCCL_DEBUG=INFO # Remove NCCL_DEBUG unless debug logging

Test with NCCL

After you configure your Pods or Slurm jobs, verify that GPUDirect RDMA works as expected by running NCCL tests across multiple Nodes. CoreWeave provides several sample NCCL test jobs designed for use with Message Passing Interface (MPI) Operator or Slurm. To test GPUDirect RDMA support with InfiniBand, see the nccl-tests repository for the test jobs and instructions for running them.

Troubleshoot RDMA scheduling and NCCL startup

This section covers the two most common problems when bringing up GPUDirect RDMA: a Node that doesn’t advertise the rdma/ib resource, and a job that hangs during NCCL initialization.

What rdma/ib: 0 means

rdma/ib: 0 on a Node means the Node isn’t advertising the RDMA resource, so the scheduler treats it as having no RDMA capacity. A Pod that requests rdma/ib doesn’t schedule on that Node and stays Pending with a FailedScheduling event such as Insufficient rdma/ib. This isn’t a problem with your Pod spec. The RDMA device plugin on that Node isn’t advertising the resource, usually because the Node hasn’t finished initializing its RDMA devices, the device plugin isn’t registered with the kubelet, or the Node has no RDMA hardware.

Detect rdma/ib: 0

# Show the rdma/ib resource reported by a specific Node.
# Safe to run anytime; read-only.
kubectl describe node [NODE-NAME] | grep -A2 "rdma/ib"

# List allocatable rdma/ib across all Nodes to spot Nodes reporting 0 or none.
# Safe to run anytime; read-only.
kubectl get nodes -o custom-columns='NODE:.metadata.name,RDMA:.status.allocatable.rdma/ib'

A Node that should have RDMA but reports 0 isn’t in a state you can fix from the workload side. Avoid scheduling on it and open a ticket. Restarting the device plugin requires platform access that customers don’t have.

First checks when NCCL hangs at init

When a multi-Node job hangs during NCCL initialization, the most common cause is that traffic silently fell back to TCP because RDMA wasn’t in use. Check these in order:

Confirm every Pod requested rdma/ib: 1 in both requests and limits. A Pod that omits the request schedules without RDMA and falls back to TCP.
Confirm the Nodes advertise rdma/ib using the preceding detection commands. If any Node reports 0, the job may have landed partly on non-RDMA capacity.
Confirm the container image includes the OFED user-space libraries. A mismatch between the image’s RDMA libraries and the Node kernel modules causes a fallback or failure.
Confirm interface selection. If NCCL_SOCKET_IFNAME or NCCL_IB_HCA point at the wrong interface, NCCL can’t find the RDMA path. Set NCCL_DEBUG=INFO and look for NET/IB (RDMA in use) versus NET/Socket (TCP fallback) in the logs.

From inside the Pod, you can also run ibstat and confirm each port shows State: Active and Physical state: LinkUp. A port that isn’t active can’t carry RDMA traffic. If a multi-Node job runs but throughput is far below expectations rather than hanging, NCCL has usually fallen back to TCP for the same reasons. For a full diagnostic walkthrough, including baselining with nccl-tests and isolating a degraded Node, see Why is my multi-node NCCL training slow?.

​Prerequisites

​Select a Node Pool with InfiniBand support

​Configure the Pods

​Kubernetes example

​Slurm example

​Test with NCCL

​Troubleshoot RDMA scheduling and NCCL startup

​What rdma/ib: 0 means

​Detect rdma/ib: 0

​First checks when NCCL hangs at init