In this guide, learn how to use GPUDirect RDMA with RoCE-based backend fabrics at CoreWeave, and how to test it with NCCL.Documentation Index
Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
Use this file to discover all available pages before exploring further.
Prerequisites
CoreWeave supports GPUDirect RDMA over RoCE (RDMA over Converged Ethernet) for some GPU instance types built on Spectrum-X Ethernet fabrics. To use this feature, you must:- select a Node Pool with RoCE support,
- install NCCL (and UCX / OFED userspace as needed) in the Pod image, then
- configure the Pods to use GPUDirect RDMA over RoCE.
Select a Node Pool with RoCE support
To use GPUDirect RDMA over RoCE, make sure the Node Pool has Nodes connected to a RoCE-capable fabric, as shown in our list of GPU instance types and your contract. All Nodes on these clusters have the required RoCE kernel drivers and firmware pre-installed. CKS manages the RoCE driver, NIC configuration, and fabric integration. To avoid Node instability, you should not install additional low-level driver management tools inside your Pods. If you’re unsure whether a given cluster or Node Pool has RoCE enabled, contact your CoreWeave representative.Configure the Pods
The Pods must be configured to use GPUDirect RDMA over RoCE. Follow these steps:- Request the RoCE RDMA resource so Pods land on Nodes with RoCE.
- Attach RoCE interfaces into the Pod using Multus.
- Configure NCCL (and optionally UCX) to use those interfaces.
gb300-4x-e), steps (1) and (2) are handled at the NodeSet level; for direct Kubernetes workloads, you configure them explicitly in the Pod spec.
1. Request the RoCE RDMA resource
Set the value ofspec.containers.resources.requests.rdma/ib to 1.
This value does not indicate the number of RoCE devices requested; it’s used as a boolean to schedule Pods onto servers that expose RoCE RDMA resources.
Kubernetes schedules resources through requests and limits. When only limits are specified, the requests are set to the same amount as the limit. To learn more about container resource management on Kubernetes, see the official Kubernetes documentation.
See the full YAML example below for reference showing how to set the rdma/ib value in the Pod spec for both requests and limits.
2. Attach RoCE interfaces with Multus
RoCE backend interfaces are exposed into Pods using Multus CNI and NetworkAttachmentDefinition (NAD) objects. On RoCE-enabled clusters, CoreWeave defines NADs that map host RoCE devices (for example, Spectrum-Xibs#p# ports) into Pod network interfaces via MACVLAN and VRF configuration.
On some clusters, you may be given a set of per-port NADs (for example, ibs0p0-macvlan, ibs0p1-macvlan, …). In that case, the k8s.v1.cni.cncf.io/networks annotation contains one entry per backend interface, for example:
nameandnamespacemust match the RoCE NADs configured in your cluster.- The attached interfaces appear inside the Pod as additional network devices (for example,
net1,net2, …) and are used by NCCL/UCX for GPUDirect RDMA traffic. - Some clusters may also provide a single “backend” NAD name; your CoreWeave representative can provide the correct annotation for your cluster.
3. Configure NCCL (and optional UCX) for RoCE
Configure the Pod to use GPUDirect RDMA over RoCE by setting these environment variables:NCCL_SOCKET_IFNAME: The front-end interface name used for NCCL’s TCP-based control and out-of-band communication. Commonly set to the primary Pod interface, for example:eth0.NCCL_IB_HCA: The RDMA interface(s) used by NCCL for GPUDirect RDMA collectives. On GB300 RoCE clusters, this is typically theibpinterface. On other clusters, it should match the RoCE device naming or the Pod interfaces created by Multus (for example, a specific interface name or a comma-separated list such asnet1,net2), depending on how your cluster is configured.UCX_NET_DEVICES(optional, for UCX-based stacks): The network devices to use for UCX communication. This is usually set to the same interfaces used for RoCE-based RDMA traffic.UCX_TLS(optional, for UCX-based stacks): The UCX transports to enable. For RoCE, you might usetcp(and other transports as needed), depending on your UCX configuration and application requirements.
4. (Optional) Enable extended logging with the NCCL_DEBUG environment variable
To increase the verbosity of NCCL’s logging, set the NCCL_DEBUG environment variable to INFO for extra debug information. This can help diagnose issues with RDMA support, but it increases the log file size, so it should be disabled when testing is complete. See NCCL_DEBUG in the NCCL documentation for more logging options.
Kubernetes example
When deploying a Kubernetes Pod in the cluster, use the highlighted lines below to:- set the
rdma/ibvalue in the Pod spec for bothrequestsandlimits, - attach RoCE backend interfaces via
k8s.v1.cni.cncf.io/networks, and - set the required environment variables.
NCCL_DEBUG to INFO enables extended logging and can be removed if extended logging is not required.
Slurm example
When deploying a Slurm job on a cluster configured for RoCE (for example, a SUNK cluster usinggb300-4x-e NodeSets), use the highlighted lines below to set the required environment variables. Remove NCCL_DEBUG unless extended logging is needed.
Testing with NCCL
CoreWeave has several sample NCCL test jobs designed for use with MPI Operator or Slurm. These are in thenccl-tests repository, which can be used to test GPUDirect RDMA support with RoCE. For more information, refer to instructions for testing in the repository.