> ## Documentation Index
> Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Use GPUDirect RDMA with RoCE

> Configure and use GPUDirect RDMA over RoCE on CoreWeave, including Pod configuration and NCCL setup

In this guide, learn how to use GPUDirect RDMA with RoCE-based backend fabrics at CoreWeave, and how to test it with NCCL. GPUDirect RDMA over RoCE lets multi-Node GPU workloads move data directly between GPU memory and the network fabric, which is essential for high-throughput, low-latency collective operations in distributed training and HPC jobs. This guide targets platform engineers and ML practitioners who deploy Pods on RoCE-enabled CoreWeave clusters.

## Prerequisites

CoreWeave supports GPUDirect RDMA over RoCE (RDMA over Converged Ethernet) for some [GPU instance types](/platform/instances/gpu-instances) built on Spectrum-X Ethernet fabrics.

To use this feature, you must:

* Select a Node Pool with RoCE support.
* Install NCCL, and UCX or OFED userspace as needed, in the Pod image.
* Configure the Pods to use GPUDirect RDMA over RoCE.

For background on CoreWeave's high-performance fabrics, see [About CoreWeave HPC interconnects](/products/networking/hpc-interconnect/about-hpc-interconnect).

## Select a Node Pool with RoCE support

To use GPUDirect RDMA over RoCE, make sure the [Node Pool](/products/cks/nodes/nodes-and-node-pools) has Nodes connected to a RoCE-capable fabric, as shown in the GPU instance types list and your contract.

All Nodes on these clusters have the required RoCE kernel drivers and firmware pre-installed. CKS manages the RoCE driver, NIC configuration, and fabric integration. To avoid Node instability, don't install additional low-level driver management tools inside your Pods.

If you're unsure whether a given cluster or Node Pool has RoCE enabled, contact your CoreWeave representative.

## Configure the Pods

Configure the Pods to use GPUDirect RDMA over RoCE. The following sections describe the three required configuration tasks, with an optional fourth task for debug logging:

1. Request the RoCE RDMA resource so Pods land on Nodes with RoCE.
2. Attach RoCE interfaces into the Pod using Multus.
3. Configure NCCL (and optionally UCX) to use those interfaces.

On some platforms (for example, SUNK NodeSets such as `gb300-4x-e`), the NodeSet handles steps 1 and 2. For direct Kubernetes workloads, you configure them explicitly in the Pod spec.

After you complete these steps, your Pods run on RoCE-capable Nodes, have the RoCE backend interfaces attached, and use NCCL configured for GPUDirect RDMA traffic.

### Request the RoCE RDMA resource

Set the value of `spec.containers.resources.requests.rdma/ib` to `1`.

This value doesn't indicate the number of RoCE devices requested. Kubernetes uses it as a boolean to schedule Pods onto servers that expose RoCE RDMA resources.

Kubernetes schedules resources through `requests` and `limits`. When you specify only `limits`, Kubernetes sets `requests` to the same amount as the limit. For more information, see the Kubernetes documentation on [container resource management](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/).

For a full YAML example showing how to set the `rdma/ib` value in the Pod spec for both `requests` and `limits`, see [Kubernetes example](#kubernetes-example).

### Attach RoCE interfaces with Multus

CoreWeave exposes RoCE backend interfaces into Pods using [Multus CNI](https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/network-plugins/#multus-cni) and NetworkAttachmentDefinition (NAD) objects.

On RoCE-enabled clusters, CoreWeave defines NADs that map host RoCE devices (for example, Spectrum-X `ibs#p#` ports) into Pod network interfaces through MACVLAN and VRF configuration.

On some clusters, CoreWeave provides a set of per-port NADs (for example, `ibs0p0-macvlan` and `ibs0p1-macvlan`). In that case, the `k8s.v1.cni.cncf.io/networks` annotation contains one entry per backend interface, for example:

```yaml theme={"system"}
metadata:
  annotations:
    k8s.v1.cni.cncf.io/networks: |-
      [{
        "name": "ibs0p0-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs0p1-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs0p2-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs0p3-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs1p0-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs1p1-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs1p2-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs1p3-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs2p0-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs2p1-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs2p2-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs2p3-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs3p0-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs3p1-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs3p2-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs3p3-macvlan",
        "namespace": "cw-multus"
      }]
```

Key points:

* `name` and `namespace` must match the RoCE NADs configured in your cluster.
* The attached interfaces appear inside the Pod as additional network devices (for example, `net1` and `net2`), and NCCL and UCX use them for GPUDirect RDMA traffic.
* Some clusters also provide a single "backend" NAD name. Your CoreWeave representative can provide the correct annotation for your cluster.

### Configure NCCL and UCX for RoCE

To configure the Pod to use GPUDirect RDMA over RoCE, set these environment variables:

* `NCCL_SOCKET_IFNAME`: The front-end interface name for NCCL's TCP-based control and out-of-band communication. Commonly set to the primary Pod interface, for example `eth0`.
* `NCCL_IB_HCA`: The RDMA interfaces that NCCL uses for GPUDirect RDMA collectives. On GB300 RoCE clusters, this is the `ibp` interface. On other clusters, it must match the RoCE device naming or the Pod interfaces that Multus creates (for example, a specific interface name or a comma-separated list such as `net1,net2`), depending on how you configured your cluster.
* `UCX_NET_DEVICES` (optional, for UCX-based stacks): The network devices to use for UCX communication. Set this to the same interfaces that handle RoCE-based RDMA traffic.
* `UCX_TLS` (optional, for UCX-based stacks): The UCX transports to enable. For RoCE, you might use `tcp` (and other transports as needed), depending on your UCX configuration and application requirements.

The following sections provide complete Kubernetes and Slurm examples that combine these configuration steps.

### Optional: Enable extended NCCL logging

To increase the verbosity of NCCL's logging, set the `NCCL_DEBUG` environment variable to `INFO` for extra debug information. This can help diagnose issues with RDMA support, but it increases the log file size, so disable it when testing is complete. For more logging options, see [`NCCL_DEBUG` in the NCCL documentation](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html?#nccl-debug).

## Kubernetes example

When deploying a Kubernetes Pod in the cluster, use the highlighted lines in the following example to:

* Set the `rdma/ib` value in the Pod spec for both `requests` and `limits`.
* Attach RoCE backend interfaces through the `k8s.v1.cni.cncf.io/networks` annotation.
* Set the required environment variables.

```yaml theme={"system"}
# [...]
metadata:
  annotations:
    k8s.v1.cni.cncf.io/networks: |-
      [{
        "name": "ibs0p0-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs0p1-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs0p2-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs0p3-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs1p0-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs1p1-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs1p2-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs1p3-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs2p0-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs2p1-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs2p2-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs2p3-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs3p0-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs3p1-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs3p2-macvlan",
        "namespace": "cw-multus"
      },{
        "name": "ibs3p3-macvlan",
        "namespace": "cw-multus"
      }]
spec:
  containers:
  - name: example
    resources:
      requests:
        cpu: 10
        memory: 10Gi
        rdma/ib: 1
        nvidia.com/gpu: 4
      limits:
        cpu: 10
        memory: 10Gi
        rdma/ib: 1
        nvidia.com/gpu: 4
    env:
      # Control / OOB traffic over primary Pod interface
      - name: NCCL_SOCKET_IFNAME
        value: eth0

      # RoCE devices used for GPUDirect RDMA collectives
      # Adjust to match your Pod's RoCE interfaces (for example, net1,net2)
      - name: NCCL_IB_HCA
        value: ibp

      # Optional: enable UCX on the RoCE interfaces
      - name: UCX_NET_DEVICES
        value: net1
      - name: UCX_TLS
        value: tcp

      # Optional: extended NCCL logging for debugging
      - name: NCCL_DEBUG
        value: INFO
# [...]
```

Setting `NCCL_DEBUG` to `INFO` enables extended logging. Remove this variable when you don't need extended logging.

## Slurm example

When deploying a Slurm job on a cluster configured for RoCE (for example, a SUNK cluster using `gb300-4x-e` NodeSets), use the highlighted lines in the following example to set the required environment variables. Remove `NCCL_DEBUG` when you don't need extended logging.

```bash theme={"system"}
#!/bin/bash

#SBATCH --partition gb300
#SBATCH --nodes 16
#SBATCH --ntasks-per-node 4
#SBATCH --gpus-per-node 4
# [...] other SBATCH options, as needed

# Control / OOB traffic over front-end Ethernet
export NCCL_SOCKET_IFNAME=eth0

# RoCE RDMA devices inside the compute Pods
# Adjust this to match the RoCE interfaces provided in your cluster
export NCCL_IB_HCA=net1

# Optional: UCX configuration if your stack uses UCX
export UCX_TLS=tcp
export UCX_NET_DEVICES=net1

# Optional: extended NCCL logging while validating RoCE behavior
export NCCL_DEBUG=INFO  # Remove NCCL_DEBUG when you don't need debug logging
```

## Test with NCCL

CoreWeave provides several sample NCCL test jobs for use with MPI Operator or Slurm. These jobs live in the [`nccl-tests` repository](https://github.com/coreweave/nccl-tests/blob/master/README.md#running-nccl-tests), and you can use them to test GPUDirect RDMA support with RoCE.
