Use GPUDirect RDMA with RoCE

In this guide, learn how to use GPUDirect RDMA with RoCE-based backend fabrics at CoreWeave, and how to test it with NCCL.

Prerequisites

CoreWeave supports GPUDirect RDMA over RoCE (RDMA over Converged Ethernet) for some GPU instance types built on Spectrum-X Ethernet fabrics. To use this feature, you must:

select a Node Pool with RoCE support,
install NCCL (and UCX / OFED userspace as needed) in the Pod image, then
configure the Pods to use GPUDirect RDMA over RoCE.

For background on CoreWeave’s high-performance fabrics, see About CoreWeave HPC interconnects.

Select a Node Pool with RoCE support

To use GPUDirect RDMA over RoCE, make sure the Node Pool has Nodes connected to a RoCE-capable fabric, as shown in our list of GPU instance types and your contract. All Nodes on these clusters have the required RoCE kernel drivers and firmware pre-installed. CKS manages the RoCE driver, NIC configuration, and fabric integration. To avoid Node instability, you should not install additional low-level driver management tools inside your Pods. If you’re unsure whether a given cluster or Node Pool has RoCE enabled, contact your CoreWeave representative.

Configure the Pods

The Pods must be configured to use GPUDirect RDMA over RoCE. Follow these steps:

Request the RoCE RDMA resource so Pods land on Nodes with RoCE.
Attach RoCE interfaces into the Pod using Multus.
Configure NCCL (and optionally UCX) to use those interfaces.

On some platforms (for example, SUNK NodeSets such as gb300-4x-e), steps (1) and (2) are handled at the NodeSet level; for direct Kubernetes workloads, you configure them explicitly in the Pod spec.

1. Request the RoCE RDMA resource

Set the value of spec.containers.resources.requests.rdma/ib to 1. This value does not indicate the number of RoCE devices requested; it’s used as a boolean to schedule Pods onto servers that expose RoCE RDMA resources. Kubernetes schedules resources through requests and limits. When only limits are specified, the requests are set to the same amount as the limit. To learn more about container resource management on Kubernetes, see the official Kubernetes documentation. See the full YAML example below for reference showing how to set the rdma/ib value in the Pod spec for both requests and limits.

2. Attach RoCE interfaces with Multus

RoCE backend interfaces are exposed into Pods using Multus CNI and NetworkAttachmentDefinition (NAD) objects. On RoCE-enabled clusters, CoreWeave defines NADs that map host RoCE devices (for example, Spectrum-X ibs#p# ports) into Pod network interfaces via MACVLAN and VRF configuration. On some clusters, you may be given a set of per-port NADs (for example, ibs0p0-macvlan, ibs0p1-macvlan, …). In that case, the k8s.v1.cni.cncf.io/networks annotation contains one entry per backend interface, for example:

metadata:
  annotations:
    k8s.v1.cni.cncf.io/networks: |-
      [{
        "name": "ibs0p0-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs0p1-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs0p2-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs0p3-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs1p0-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs1p1-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs1p2-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs1p3-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs2p0-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs2p1-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs2p2-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs2p3-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs3p0-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs3p1-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs3p2-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs3p3-macvlan",
        "namespace": "cw-multus"
      }]

Key points:

name and namespace must match the RoCE NADs configured in your cluster.
The attached interfaces appear inside the Pod as additional network devices (for example, net1, net2, …) and are used by NCCL/UCX for GPUDirect RDMA traffic.
Some clusters may also provide a single “backend” NAD name; your CoreWeave representative can provide the correct annotation for your cluster.

3. Configure NCCL (and optional UCX) for RoCE

Configure the Pod to use GPUDirect RDMA over RoCE by setting these environment variables:

NCCL_SOCKET_IFNAME: The front-end interface name used for NCCL’s TCP-based control and out-of-band communication. Commonly set to the primary Pod interface, for example: eth0.
NCCL_IB_HCA: The RDMA interface(s) used by NCCL for GPUDirect RDMA collectives. On GB300 RoCE clusters, this is typically the ibp interface. On other clusters, it should match the RoCE device naming or the Pod interfaces created by Multus (for example, a specific interface name or a comma-separated list such as net1,net2), depending on how your cluster is configured.
UCX_NET_DEVICES (optional, for UCX-based stacks): The network devices to use for UCX communication. This is usually set to the same interfaces used for RoCE-based RDMA traffic.
UCX_TLS (optional, for UCX-based stacks): The UCX transports to enable. For RoCE, you might use tcp (and other transports as needed), depending on your UCX configuration and application requirements.

Examples for Kubernetes and Slurm are in the sections below.

4. (Optional) Enable extended logging with the `NCCL_DEBUG` environment variable

To increase the verbosity of NCCL’s logging, set the NCCL_DEBUG environment variable to INFO for extra debug information. This can help diagnose issues with RDMA support, but it increases the log file size, so it should be disabled when testing is complete. See NCCL_DEBUG in the NCCL documentation for more logging options.

Kubernetes example

When deploying a Kubernetes Pod in the cluster, use the highlighted lines below to:

set the rdma/ib value in the Pod spec for both requests and limits,
attach RoCE backend interfaces via k8s.v1.cni.cncf.io/networks, and
set the required environment variables.

# [...]
metadata:
  annotations:
    k8s.v1.cni.cncf.io/networks: |-
      [{
        "name": "ibs0p0-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs0p1-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs0p2-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs0p3-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs1p0-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs1p1-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs1p2-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs1p3-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs2p0-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs2p1-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs2p2-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs2p3-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs3p0-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs3p1-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs3p2-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs3p3-macvlan",
        "namespace": "cw-multus"
      }]
spec:
  containers:
  - name: example
    resources:
      requests:
        cpu: 10
        memory: 10Gi
        rdma/ib: 1
        nvidia.com/gpu: 4
      limits:
        cpu: 10
        memory: 10Gi
        rdma/ib: 1
        nvidia.com/gpu: 4
    env:
      # Control / OOB traffic over primary Pod interface
      - name: NCCL_SOCKET_IFNAME
        value: eth0

      # RoCE devices used for GPUDirect RDMA collectives
      # Adjust to match your Pod's RoCE interfaces (for example, net1,net2)
      - name: NCCL_IB_HCA
        value: ibp

      # Optional: enable UCX on the RoCE interfaces
      - name: UCX_NET_DEVICES
        value: net1
      - name: UCX_TLS
        value: tcp

      # Optional: extended NCCL logging for debugging
      - name: NCCL_DEBUG
        value: INFO
# [...]

Setting NCCL_DEBUG to INFO enables extended logging and can be removed if extended logging is not required.

Slurm example

When deploying a Slurm job on a cluster configured for RoCE (for example, a SUNK cluster using gb300-4x-e NodeSets), use the highlighted lines below to set the required environment variables. Remove NCCL_DEBUG unless extended logging is needed.

#!/bin/bash

#SBATCH --partition gb300
#SBATCH --nodes 16
#SBATCH --ntasks-per-node 4
#SBATCH --gpus-per-node 4
# [...] other SBATCH options, as needed

# Control / OOB traffic over front-end Ethernet
export NCCL_SOCKET_IFNAME=eth0

# RoCE RDMA devices inside the compute Pods
# Adjust this to match the RoCE interfaces provided in your cluster
export NCCL_IB_HCA=net1

# Optional: UCX configuration if your stack uses UCX
export UCX_TLS=tcp
export UCX_NET_DEVICES=net1

# Optional: extended NCCL logging while validating RoCE behavior
export NCCL_DEBUG=INFO  # Remove NCCL_DEBUG unless debug logging is needed

Testing with NCCL

CoreWeave has several sample NCCL test jobs designed for use with MPI Operator or Slurm. These are in the nccl-tests repository, which can be used to test GPUDirect RDMA support with RoCE. For more information, refer to instructions for testing in the repository.

Networking

Documentation Index

​Prerequisites

​Select a Node Pool with RoCE support

​Configure the Pods

​1. Request the RoCE RDMA resource

​2. Attach RoCE interfaces with Multus

​3. Configure NCCL (and optional UCX) for RoCE

​4. (Optional) Enable extended logging with the NCCL_DEBUG environment variable

​Kubernetes example

​Slurm example

​Testing with NCCL