Skip to main content

NVSHMEM and GDRCopy Support

Use NVSHMEM and GDRCopy for high-performance GPU-to-GPU communication

With the release of ncore-image v.2.10.1, you can use NVSHMEM and GDRCopy in your container image for high-performance GPU-to-GPU communication.

Overview

NVSHMEM (NVIDIA SHMEM) and GDRCopy (GPU Direct RDMA Copy) enable direct memory access between GPUs without involving the CPU, significantly reducing latency and increasing throughput for certain workloads.

Accessing the image

To gain access to ncore-image v.2.10.1, contact CoreWeave Support.

Image modifications

ncore-image v.2.10.1 contains the following modifications to support NVSHMEM usage with ibdga:

NVIDIA driver options

  • nvidia.NVreg_EnableStreamMemOPs=1
  • nvidia.NVreg_RegistryDwords="PeerMappingOverride=1;"

GDRCopy driver

  • gdrdrv-dkms_2.5-1

Using the image

When using this image:

1. Enable GDRCopy environment variable

Make sure the environment variable in the container is set to enable GDRCopy. This allows you to access gdrdrv:

Example
env:
- name: NVIDIA_GDRCOPY
value: enabled
Note

If you're using SLURM, this environment variable is already set.

2. Patch NVSHMEM ibgda

In NVSHMEM version 3.2.5, patch ibgda in your container(s). Download NVSHMEM version 3.2.5.

In src/modules/transport/ibgda/ibgda.cpp, line 3659 needs to be changed.

Original code:

Example
if (!strstr(name, "ibp")) {
ftable.close_device(device->context);
device->context = NULL;
NVSHMEMI_WARN_PRINT("device %s is not enumerated as an mlx5 device. Skipping...\n",
name);
continue;
}

Modified code:

Example
if (!strstr(name, "mlx5")) {
ftable.close_device(device->context);
device->context = NULL;
NVSHMEMI_WARN_PRINT("device %s is not enumerated as an mlx5 device. Skipping...\n",
name);
continue;
}

Additional resources