Skip to main content
In this guide, learn how to use GPUDirect RDMA with RoCE-based backend fabrics at CoreWeave, and how to test it with NCCL. GPUDirect RDMA over RoCE lets multi-Node GPU workloads move data directly between GPU memory and the network fabric, which is essential for high-throughput, low-latency collective operations in distributed training and HPC jobs. This guide targets platform engineers and ML practitioners who deploy Pods on RoCE-enabled CoreWeave clusters.

Prerequisites

CoreWeave supports GPUDirect RDMA over RoCE (RDMA over Converged Ethernet) for some GPU instance types built on Spectrum-X Ethernet fabrics. To use this feature, you must:
  • Select a Node Pool with RoCE support.
  • Install NCCL, and UCX or OFED userspace as needed, in the Pod image.
  • Configure the Pods to use GPUDirect RDMA over RoCE.
For background on CoreWeave’s high-performance fabrics, see About CoreWeave HPC interconnects.

Select a Node Pool with RoCE support

To use GPUDirect RDMA over RoCE, make sure the Node Pool has Nodes connected to a RoCE-capable fabric, as shown in the GPU instance types list and your contract. All Nodes on these clusters have the required RoCE kernel drivers and firmware pre-installed. CKS manages the RoCE driver, NIC configuration, and fabric integration. To avoid Node instability, don’t install additional low-level driver management tools inside your Pods. If you’re unsure whether a given cluster or Node Pool has RoCE enabled, contact your CoreWeave representative.

Configure the Pods

Configure the Pods to use GPUDirect RDMA over RoCE. The following sections describe the three required configuration tasks, with an optional fourth task for debug logging:
  1. Request the RoCE RDMA resource so Pods land on Nodes with RoCE.
  2. Attach RoCE interfaces into the Pod using Multus.
  3. Configure NCCL (and optionally UCX) to use those interfaces.
On some platforms (for example, SUNK NodeSets such as gb300-4x-e), the NodeSet handles steps 1 and 2. For direct Kubernetes workloads, you configure them explicitly in the Pod spec. After you complete these steps, your Pods run on RoCE-capable Nodes, have the RoCE backend interfaces attached, and use NCCL configured for GPUDirect RDMA traffic.

Request the RoCE RDMA resource

Set the value of spec.containers.resources.requests.rdma/ib to 1. This value doesn’t indicate the number of RoCE devices requested. Kubernetes uses it as a boolean to schedule Pods onto servers that expose RoCE RDMA resources. Kubernetes schedules resources through requests and limits. When you specify only limits, Kubernetes sets requests to the same amount as the limit. For more information, see the Kubernetes documentation on container resource management. For a full YAML example showing how to set the rdma/ib value in the Pod spec for both requests and limits, see Kubernetes example.

Attach RoCE interfaces with Multus

CoreWeave exposes RoCE backend interfaces into Pods using Multus CNI and NetworkAttachmentDefinition (NAD) objects. On RoCE-enabled clusters, CoreWeave defines NADs that map host RoCE devices (for example, Spectrum-X ibs#p# ports) into Pod network interfaces through MACVLAN and VRF configuration. On some clusters, CoreWeave provides a set of per-port NADs (for example, ibs0p0-macvlan and ibs0p1-macvlan). In that case, the k8s.v1.cni.cncf.io/networks annotation contains one entry per backend interface, for example:
metadata:
  annotations:
    k8s.v1.cni.cncf.io/networks: |-
      [{
        "name": "ibs0p0-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs0p1-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs0p2-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs0p3-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs1p0-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs1p1-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs1p2-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs1p3-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs2p0-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs2p1-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs2p2-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs2p3-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs3p0-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs3p1-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs3p2-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs3p3-macvlan",
        "namespace": "cw-multus"
      }]
Key points:
  • name and namespace must match the RoCE NADs configured in your cluster.
  • The attached interfaces appear inside the Pod as additional network devices (for example, net1 and net2), and NCCL and UCX use them for GPUDirect RDMA traffic.
  • Some clusters also provide a single “backend” NAD name. Your CoreWeave representative can provide the correct annotation for your cluster.

Configure NCCL and UCX for RoCE

To configure the Pod to use GPUDirect RDMA over RoCE, set these environment variables:
  • NCCL_SOCKET_IFNAME: The front-end interface name for NCCL’s TCP-based control and out-of-band communication. Commonly set to the primary Pod interface, for example eth0.
  • NCCL_IB_HCA: The RDMA interfaces that NCCL uses for GPUDirect RDMA collectives. On GB300 RoCE clusters, this is the ibp interface. On other clusters, it must match the RoCE device naming or the Pod interfaces that Multus creates (for example, a specific interface name or a comma-separated list such as net1,net2), depending on how you configured your cluster.
  • UCX_NET_DEVICES (optional, for UCX-based stacks): The network devices to use for UCX communication. Set this to the same interfaces that handle RoCE-based RDMA traffic.
  • UCX_TLS (optional, for UCX-based stacks): The UCX transports to enable. For RoCE, you might use tcp (and other transports as needed), depending on your UCX configuration and application requirements.
The following sections provide complete Kubernetes and Slurm examples that combine these configuration steps.

Optional: Enable extended NCCL logging

To increase the verbosity of NCCL’s logging, set the NCCL_DEBUG environment variable to INFO for extra debug information. This can help diagnose issues with RDMA support, but it increases the log file size, so disable it when testing is complete. For more logging options, see NCCL_DEBUG in the NCCL documentation.

Kubernetes example

When deploying a Kubernetes Pod in the cluster, use the highlighted lines in the following example to:
  • Set the rdma/ib value in the Pod spec for both requests and limits.
  • Attach RoCE backend interfaces through the k8s.v1.cni.cncf.io/networks annotation.
  • Set the required environment variables.
# [...]
metadata:
  annotations:
    k8s.v1.cni.cncf.io/networks: |-
      [{
        "name": "ibs0p0-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs0p1-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs0p2-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs0p3-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs1p0-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs1p1-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs1p2-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs1p3-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs2p0-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs2p1-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs2p2-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs2p3-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs3p0-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs3p1-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs3p2-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs3p3-macvlan",
        "namespace": "cw-multus"
      }]
spec:
  containers:
  - name: example
    resources:
      requests:
        cpu: 10
        memory: 10Gi
        rdma/ib: 1
        nvidia.com/gpu: 4
      limits:
        cpu: 10
        memory: 10Gi
        rdma/ib: 1
        nvidia.com/gpu: 4
    env:
      # Control / OOB traffic over primary Pod interface
      - name: NCCL_SOCKET_IFNAME
        value: eth0

      # RoCE devices used for GPUDirect RDMA collectives
      # Adjust to match your Pod's RoCE interfaces (for example, net1,net2)
      - name: NCCL_IB_HCA
        value: ibp

      # Optional: enable UCX on the RoCE interfaces
      - name: UCX_NET_DEVICES
        value: net1
      - name: UCX_TLS
        value: tcp

      # Optional: extended NCCL logging for debugging
      - name: NCCL_DEBUG
        value: INFO
# [...]
Setting NCCL_DEBUG to INFO enables extended logging. Remove this variable when you don’t need extended logging.

Slurm example

When deploying a Slurm job on a cluster configured for RoCE (for example, a SUNK cluster using gb300-4x-e NodeSets), use the highlighted lines in the following example to set the required environment variables. Remove NCCL_DEBUG when you don’t need extended logging.
#!/bin/bash

#SBATCH --partition gb300
#SBATCH --nodes 16
#SBATCH --ntasks-per-node 4
#SBATCH --gpus-per-node 4
# [...] other SBATCH options, as needed

# Control / OOB traffic over front-end Ethernet
export NCCL_SOCKET_IFNAME=eth0

# RoCE RDMA devices inside the compute Pods
# Adjust this to match the RoCE interfaces provided in your cluster
export NCCL_IB_HCA=net1

# Optional: UCX configuration if your stack uses UCX
export UCX_TLS=tcp
export UCX_NET_DEVICES=net1

# Optional: extended NCCL logging while validating RoCE behavior
export NCCL_DEBUG=INFO  # Remove NCCL_DEBUG when you don't need debug logging

Test with NCCL

CoreWeave provides several sample NCCL test jobs for use with MPI Operator or Slurm. These jobs live in the nccl-tests repository, and you can use them to test GPUDirect RDMA support with RoCE.
Last modified on May 29, 2026