Prerequisites
CoreWeave supports GPUDirect RDMA over RoCE (RDMA over Converged Ethernet) for some GPU instance types built on Spectrum-X Ethernet fabrics. To use this feature, you must:- Select a Node Pool with RoCE support.
- Install NCCL, and UCX or OFED userspace as needed, in the Pod image.
- Configure the Pods to use GPUDirect RDMA over RoCE.
Select a Node Pool with RoCE support
To use GPUDirect RDMA over RoCE, make sure the Node Pool has Nodes connected to a RoCE-capable fabric, as shown in the GPU instance types list and your contract. All Nodes on these clusters have the required RoCE kernel drivers and firmware pre-installed. CKS manages the RoCE driver, NIC configuration, and fabric integration. To avoid Node instability, don’t install additional low-level driver management tools inside your Pods. If you’re unsure whether a given cluster or Node Pool has RoCE enabled, contact your CoreWeave representative.Configure the Pods
Configure the Pods to use GPUDirect RDMA over RoCE. The following sections describe the three required configuration tasks, with an optional fourth task for debug logging:- Request the RoCE RDMA resource so Pods land on Nodes with RoCE.
- Attach RoCE interfaces into the Pod using Multus.
- Configure NCCL (and optionally UCX) to use those interfaces.
gb300-4x-e), the NodeSet handles steps 1 and 2. For direct Kubernetes workloads, you configure them explicitly in the Pod spec.
After you complete these steps, your Pods run on RoCE-capable Nodes, have the RoCE backend interfaces attached, and use NCCL configured for GPUDirect RDMA traffic.
Request the RoCE RDMA resource
Set the value ofspec.containers.resources.requests.rdma/ib to 1.
This value doesn’t indicate the number of RoCE devices requested. Kubernetes uses it as a boolean to schedule Pods onto servers that expose RoCE RDMA resources.
Kubernetes schedules resources through requests and limits. When you specify only limits, Kubernetes sets requests to the same amount as the limit. For more information, see the Kubernetes documentation on container resource management.
For a full YAML example showing how to set the rdma/ib value in the Pod spec for both requests and limits, see Kubernetes example.
Attach RoCE interfaces with Multus
CoreWeave exposes RoCE backend interfaces into Pods using Multus CNI and NetworkAttachmentDefinition (NAD) objects. On RoCE-enabled clusters, CoreWeave defines NADs that map host RoCE devices (for example, Spectrum-Xibs#p# ports) into Pod network interfaces through MACVLAN and VRF configuration.
On some clusters, CoreWeave provides a set of per-port NADs (for example, ibs0p0-macvlan and ibs0p1-macvlan). In that case, the k8s.v1.cni.cncf.io/networks annotation contains one entry per backend interface, for example:
nameandnamespacemust match the RoCE NADs configured in your cluster.- The attached interfaces appear inside the Pod as additional network devices (for example,
net1andnet2), and NCCL and UCX use them for GPUDirect RDMA traffic. - Some clusters also provide a single “backend” NAD name. Your CoreWeave representative can provide the correct annotation for your cluster.
Configure NCCL and UCX for RoCE
To configure the Pod to use GPUDirect RDMA over RoCE, set these environment variables:NCCL_SOCKET_IFNAME: The front-end interface name for NCCL’s TCP-based control and out-of-band communication. Commonly set to the primary Pod interface, for exampleeth0.NCCL_IB_HCA: The RDMA interfaces that NCCL uses for GPUDirect RDMA collectives. On GB300 RoCE clusters, this is theibpinterface. On other clusters, it must match the RoCE device naming or the Pod interfaces that Multus creates (for example, a specific interface name or a comma-separated list such asnet1,net2), depending on how you configured your cluster.UCX_NET_DEVICES(optional, for UCX-based stacks): The network devices to use for UCX communication. Set this to the same interfaces that handle RoCE-based RDMA traffic.UCX_TLS(optional, for UCX-based stacks): The UCX transports to enable. For RoCE, you might usetcp(and other transports as needed), depending on your UCX configuration and application requirements.
Optional: Enable extended NCCL logging
To increase the verbosity of NCCL’s logging, set theNCCL_DEBUG environment variable to INFO for extra debug information. This can help diagnose issues with RDMA support, but it increases the log file size, so disable it when testing is complete. For more logging options, see NCCL_DEBUG in the NCCL documentation.
Kubernetes example
When deploying a Kubernetes Pod in the cluster, use the highlighted lines in the following example to:- Set the
rdma/ibvalue in the Pod spec for bothrequestsandlimits. - Attach RoCE backend interfaces through the
k8s.v1.cni.cncf.io/networksannotation. - Set the required environment variables.
NCCL_DEBUG to INFO enables extended logging. Remove this variable when you don’t need extended logging.
Slurm example
When deploying a Slurm job on a cluster configured for RoCE (for example, a SUNK cluster usinggb300-4x-e NodeSets), use the highlighted lines in the following example to set the required environment variables. Remove NCCL_DEBUG when you don’t need extended logging.
Test with NCCL
CoreWeave provides several sample NCCL test jobs for use with MPI Operator or Slurm. These jobs live in thenccl-tests repository, and you can use them to test GPUDirect RDMA support with RoCE.