ncore-image v.2.10.1, you can use NVSHMEM and GDRCopy in your container image for high-performance GPU-to-GPU communication. This page is for cluster operators and workload engineers who need to enable direct GPU-to-GPU memory access on SUNK or CKS nodes.
Overview
NVSHMEM (NVIDIA SHMEM) and GDRCopy (GPU Direct RDMA Copy) enable direct memory access between GPUs without involving the CPU, reducing latency and increasing throughput for certain workloads. The following sections describe how to obtain the supported image, what modifications it includes, and how to configure your containers to use NVSHMEM and GDRCopy.Access the image
To gain access toncore-image v.2.10.1, contact CoreWeave Support.
Image modifications
ncore-image v.2.10.1 contains the following modifications to support NVSHMEM usage with ibgda. These modifications are applied at the node level and do not require changes from workload authors.
The image includes the following NVIDIA driver options:
nvidia.NVreg_EnableStreamMemOPs=1nvidia.NVreg_RegistryDwords="PeerMappingOverride=1;"
gdrdrv-dkms_2.5-1.
Use the image
When you use this image, you must complete the following steps so that your containers can access GDRCopy and NVSHMEM correctly.Enable the GDRCopy environment variable
Set the environment variable in the container to enable GDRCopy. This lets the container accessgdrdrv:
If you’re using Slurm, this environment variable is already set.
Patch NVSHMEM ibgda
In NVSHMEM version 3.2.5, you must patchibgda in one or more of your containers so that NVSHMEM recognizes the InfiniBand devices presented on CoreWeave nodes. Download NVSHMEM version 3.2.5.
In src/modules/transport/ibgda/ibgda.cpp, change line 3659 from mlx5 to ibp to work in SUNK and CKS.
Original code: