Use GPUDirect RDMA with InfiniBand

Learn how to use GPUDirect with InfiniBand

In this guide, learn how to use GPUDirect RDMA with InfiniBand at CoreWeave, and how to test it with NCCL.

Prerequisites

CoreWeave supports GPUDirect RDMA over InfiniBand for some instance types.

To use this feature, you must:

select a Node Pool with InfiniBand support,
install NCCL and the OFED driver in the Pod image, then,
configure the Pods to use GPUDirect RDMA.

Select a Node Pool with Infiniband support

To use GPUDirect RDMA, make sure the Node Pool has Nodes with InfiniBand, as shown in our list of instance types. All Nodes with InfiniBand have the required Mellanox OFED kernel drivers pre-installed.

Install NCCL and the OFED driver

Install the NVIDIA Collective Communications Library (NCCL) and NVIDIA OFED driver in the Pod image. These are required for GPUDirect RDMA over InfiniBand. CoreWeave publishes a repository of Dockerfiles with NCCL and the required OFED drivers pre-installed, which you can use for testing or as templates for your own distributed training workloads.

Configure the Pods

The Pods must be configured to use GPUDirect RDMA over InfiniBand. Follow these steps:

Set the value of spec.containers.resources.requests.rdma/ib to 1.

This value does not indicate the number of InfiniBand devices requested, it's used as a boolean to schedule Pods onto servers with InfiniBand support.

Kubernetes schedules resources through requests and limits. When only limits are specified, the requests are set to the same amount as the limit. To learn more about container resource management on Kubernetes, see the official Kubernetes documentation.

See the full YAML example below for reference showing how to set the rdma/ib value in the Pod spec for both requests and limits.
Configure the Pods to use GPUDirect RDMA by setting these environment variables:
- NCCL_SOCKET_IFNAME: The network interface name to use for NCCL communication. This should be set to the InfiniBand interface name.
- NCCL_IB_HCA: The InfiniBand host channel adapter (HCA) to use for NCCL communication.
- UCX_NET_DEVICES: The network devices to use for UCX communication. This should be set to the InfiniBand interface name.

Examples for Kubernetes and Slurm are in the sections below.

(Optional) Enable extended logging with the NCCL_DEBUG environment variable.

To increase the verbosity of NCCL's logging, set the NCCL_DEBUG environment variable to INFO for extra debug information. This can help diagnose issues with RDMA support, but it increases the log file size, so it should be disabled when testing is complete. See NCCL_DEBUG in the NCCL documentation for more logging options.

Kubernetes example

When deploying a Kubernetes Pod in the cluster, use the highlighted lines below to set the rdma/ib value in the Pod spec for both requests and limits, and set the required environment variables. Remove NCCL_DEBUG unless extended logging is needed.

Kubernetes example with debug logging

[...]
spec:
  containers:
  - name: example
    resources:
      requests:
        cpu: 10
        memory: 10Gi
        rdma/ib: 1
        nvidia.com/gpu: 8
      limits:
        cpu: 10
        memory: 10Gi
        rdma/ib: 1
        nvidia.com/gpu: 8
  env:
    - name: NCCL_SOCKET_IFNAME
      value: eth0
    - name: NCCL_IB_HCA
      value: ibp
    - name: UCX_NET_DEVICES
      value: ibp0:1,ibp1:1,ibp2:1,ibp3:1,ibp4:1,ibp5:1,ibp6:1,ibp7:1
    - name: NCCL_DEBUG
      value: INFO
Remove NCCL_DEBUG unless debug logging
[...]

Slurm example

When deploying a Slurm job, use the highlighted lines below to set the required environment variables. Remove NCCL_DEBUG unless extended logging is needed.

Example Slurm sbatch script

#!/bin/bash

#SBATCH --partition h100
#SBATCH --nodes 16
#SBATCH --ntasks-per-node 8
#SBATCH --gpus-per-node 8
# [...] other SBATCH options, as needed

export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_HCA=ibp
export UCX_NET_DEVICES=ibp0:1,ibp1:1,ibp2:1,ibp3:1,ibp4:1,ibp5:1,ibp6:1,ibp7:1
export NCCL_DEBUG=INFO
Remove NCCL_DEBUG unless debug logging

Testing with NCCL

CoreWeave has several sample NCCL test jobs designed for use with MPI Operator or Slurm. These are in the nccl-tests repository provides , which can be used to test GPUDirect RDMA support with InfiniBand. For more information, refer to instructions for testing in the repository.

Prerequisites​

Select a Node Pool with Infiniband support​

Install NCCL and the OFED driver​

Configure the Pods​

Kubernetes example​

Slurm example​

Testing with NCCL​