NVSHMEM and GDRCopy Support
Use NVSHMEM and GDRCopy for high-performance GPU-to-GPU communication
With the release of ncore-image v.2.10.1, you can use NVSHMEM and GDRCopy in your container image for high-performance GPU-to-GPU communication.
Overview
NVSHMEM (NVIDIA SHMEM) and GDRCopy (GPU Direct RDMA Copy) enable direct memory access between GPUs without involving the CPU, significantly reducing latency and increasing throughput for certain workloads.
Accessing the image
To gain access to ncore-image v.2.10.1, contact CoreWeave Support.
Image modifications
ncore-image v.2.10.1 contains the following modifications to support NVSHMEM usage with ibdga:
NVIDIA driver options
nvidia.NVreg_EnableStreamMemOPs=1
nvidia.NVreg_RegistryDwords="PeerMappingOverride=1;"
GDRCopy driver
gdrdrv-dkms_2.5-1
Using the image
When using this image:
1. Enable GDRCopy environment variable
Make sure the environment variable in the container is set to enable GDRCopy. This allows you to access gdrdrv:
env:- name: NVIDIA_GDRCOPYvalue: enabled
If you're using SLURM, this environment variable is already set.
2. Patch NVSHMEM ibgda
In NVSHMEM version 3.2.5, patch ibgda
in your container(s). Download NVSHMEM version 3.2.5.
In src/modules/transport/ibgda/ibgda.cpp
, line 3659 needs to be changed.
Original code:
if (!strstr(name, "ibp")) {ftable.close_device(device->context);device->context = NULL;NVSHMEMI_WARN_PRINT("device %s is not enumerated as an mlx5 device. Skipping...\n",name);continue;}
Modified code:
if (!strstr(name, "mlx5")) {ftable.close_device(device->context);device->context = NULL;NVSHMEMI_WARN_PRINT("device %s is not enumerated as an mlx5 device. Skipping...\n",name);continue;}