Prerequisites
Before you begin, confirm your environment meets the following requirements:- SUNK v7.4.0 or later.
- NCCL 2.28.2 or later in your container image.
- At least 2 vCPUs allocated per task.
Enable the plugin
In your cluster’sslurm_values.yaml, set compute.gpusd.enabled to true. See Slurm parameter reference for more information.
- Downloads and installs the GPUSD package on compute nodes at startup.
- Exposes ports
10400-10407on compute pods for metrics collection. - Deploys a VMPodScrape resource to scrape NCCL plugin metrics.
--container or --container-image, an enroot hook automatically mounts the plugin into the container and sets NCCL_PROFILER_PLUGIN in the job environment. Container-based jobs don’t need additional configuration.
For jobs running without a container, add the following environment variable to your batch job script:
Enable debugging information
To confirm the plugin loaded successfully and to capture useful logs during initial validation, enable NCCL debug output. On first run, enableexport NCCL_DEBUG=INFO to print debugging information. If you’ve correctly loaded the plugin, the output includes a line resembling the following:
| Variable | Purpose |
|---|---|
GPUSD_PERF_DEBUG=0 | Disable performance metrics (hang detection only) |
GPUSD_PERF_DEBUG=1 | Enable performance metrics (always on) |
GPUSD_PERF_DEBUG=2 | Toggle metrics with SIGUSR1 (on) or SIGUSR2 (off) |
GPUSD_DEBUG=VERSION | Minimal logging |
GPUSD_DEBUG=INFO | Standard logging |
GPUSD_DEBUG=TRACE | Verbose logging |
View metrics in Grafana
The Slurm Job Metrics dashboard in CoreWeave Grafana surfaces Straggler Detection data across several panels:| Panel | Description |
|---|---|
| GPU Straggler Detection | Identifies the rank and node causing a hang |
| GPU Straggler Detection overlay | Overlays hung rank information onto the Slurm Job Metrics panel |
| NCCL Metrics row | Shows NCCL latency, throughput, message size, and slow GPUs |