Skip to main content

Use GPUDirect RDMA with InfiniBand

Learn how to use GPUDirect with InfiniBand

In this guide, learn how to use GPUDirect RDMA with InfiniBand at CoreWeave, and how to test it with NCCL.

Prerequisites

CoreWeave supports GPUDirect RDMA over InfiniBand for some instance types.

To use this feature, you must:

  • select instance types that support InfiniBand,
  • install the required user-space components to workload Pods, then,
  • schedule workload Pods to run on Nodes equipped with InfiniBand.

Procedure

Select instance types that support InfiniBand

To use GPUDirect RDMA, first ensure that you have deployed a Node Pool that selects Nodes with InfiniBand support.

Info

To learn which instance types support InfiniBand, see the list of supported instance types. All servers with InfiniBand support already have the required Mellanox OFED kernel drivers pre-installed.

Verify that the Node Pool is healthy and ready to run workloads.

Install the required user-space components to workload Pods

Once the Node Pool is in a healthy state, necessary user-space components must be installed in the Pod image. These are:

Tip

CoreWeave publishes a repository of Dockerfiles with NCCL and the required OFED drivers pre-installed, which you can use for testing or as templates for your own distributed training workloads.

Enable GPUDirect RDMA support for NCCL

To enable GPUDirect RDMA support for NCCL, set the following environment variables:

Example
# NCCL environment variables are documented at:
# https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html
$
export NCCL_SOCKET_IFNAME=eth0
$
export NCCL_IB_HCA=ibp
$
export UCX_NET_DEVICES=ibp0:1,ibp1:1,ibp2:1,ibp3:1,ibp4:1,ibp5:1,ibp6:1,ibp7:1
Note

Environment variables can also be set statically in /etc/nccl.conf, or in ~/.nccl.conf.

Force RDMA over InfiniBand

To force NCCL to use GPUDirect RDMA for communication, even when other channels are available, set NCCL_NET_GDR_LEVEL to SYS:

Example
$
export NCCL_NET_GDR_LEVEL=SYS

NCCL testing could still succeed over a different channel if this variable isn't set. See the NCCL documentation to learn more about the options available.

Extended logging

During testing, you can set the NCCL_DEBUG environment variable to INFO to print debug information from NCCL. This increases the verbosity of the output, which can help you diagnose issues with RDMA support, however, enabling this option increases the log file size, so it should be disabled after testing is complete.

Example
$
export NCCL_DEBUG=INFO
Info

For more info, see the NCCL official documentation.

Schedule workload Pods to run on Nodes with InfiniBand

Now that the Node Pool has been configured to select instance types with InfiniBand support, Pods must be scheduled to run on these desired servers by setting the rdma/ib value to 1:

Example
[...]
spec:
containers:
- name: example
resources:
requests:
cpu: 10
memory: 10Gi
rdma/ib: 1
nvidia.com/gpu: 8
limits:
cpu: 10
memory: 10Gi
rdma/ib: 1
nvidia.com/gpu: 8
[...]

This value is a boolean used to schedule Pods onto servers featuring InfiniBand support. It does not indicate the number of InfiniBand devices requested.

Note

Kubernetes schedules resources through requests and limits. When only limits are specified, the requests are set to the same amount as the limit. To learn more about container resource management on Kubernetes, see the official Kubernetes documentation.

Testing with NCCL

CoreWeave's nccl-tests repository provides several sample NCCL test jobs designed for use with MPI Operator or Slurm, which can be used to test GPUDirect RDMA support with InfiniBand. For more information, refer to instructions for testing in the repository.