Use GPUDirect RDMA with InfiniBand
Learn how to use GPUDirect with InfiniBand
In this guide, learn how to use GPUDirect RDMA with InfiniBand at CoreWeave, and how to test it with NCCL.
Prerequisites
CoreWeave supports GPUDirect RDMA over InfiniBand for some instance types.
To use this feature, you must:
- select instance types that support InfiniBand,
- install the required user-space components to workload Pods, then,
- schedule workload Pods to run on Nodes equipped with InfiniBand.
Procedure
Select instance types that support InfiniBand
To use GPUDirect RDMA, first ensure that you have deployed a Node Pool that selects Nodes with InfiniBand support.
To learn which instance types support InfiniBand, see the list of supported instance types. All servers with InfiniBand support already have the required Mellanox OFED kernel drivers pre-installed.
Verify that the Node Pool is healthy and ready to run workloads.
Install the required user-space components to workload Pods
Once the Node Pool is in a healthy state, necessary user-space components must be installed in the Pod image. These are:
CoreWeave publishes a repository of Dockerfiles with NCCL and the required OFED drivers pre-installed, which you can use for testing or as templates for your own distributed training workloads.
Enable GPUDirect RDMA support for NCCL
To enable GPUDirect RDMA support for NCCL, set the following environment variables:
# NCCL environment variables are documented at:# https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html$export NCCL_SOCKET_IFNAME=eth0$export NCCL_IB_HCA=ibp$export UCX_NET_DEVICES=ibp0:1,ibp1:1,ibp2:1,ibp3:1,ibp4:1,ibp5:1,ibp6:1,ibp7:1
Environment variables can also be set statically in /etc/nccl.conf
, or in ~/.nccl.conf
.
Force RDMA over InfiniBand
To force NCCL to use GPUDirect RDMA for communication, even when other channels are available, set NCCL_NET_GDR_LEVEL
to SYS
:
$export NCCL_NET_GDR_LEVEL=SYS
NCCL testing could still succeed over a different channel if this variable isn't set. See the NCCL documentation to learn more about the options available.
Extended logging
During testing, you can set the NCCL_DEBUG
environment variable to INFO
to print debug information from NCCL. This increases the verbosity of the output, which can help you diagnose issues with RDMA support, however, enabling this option increases the log file size, so it should be disabled after testing is complete.
$export NCCL_DEBUG=INFO
For more info, see the NCCL official documentation.
Schedule workload Pods to run on Nodes with InfiniBand
Now that the Node Pool has been configured to select instance types with InfiniBand support, Pods must be scheduled to run on these desired servers by setting the rdma/ib
value to 1
:
[...]spec:containers:- name: exampleresources:requests:cpu: 10memory: 10Girdma/ib: 1nvidia.com/gpu: 8limits:cpu: 10memory: 10Girdma/ib: 1nvidia.com/gpu: 8[...]
This value is a boolean used to schedule Pods onto servers featuring InfiniBand support. It does not indicate the number of InfiniBand devices requested.
Kubernetes schedules resources through requests
and limits
. When only limits
are specified, the requests
are set to the same amount as the limit. To learn more about container resource management on Kubernetes, see the official Kubernetes documentation.
Testing with NCCL
CoreWeave's nccl-tests
repository provides several sample NCCL test jobs designed for use with MPI Operator or Slurm, which can be used to test GPUDirect RDMA support with InfiniBand. For more information, refer to instructions for testing in the repository.