Enable GPU straggler detection

CoreWeave Straggler Detection tracks each GPU in a distributed training job and alerts when one falls out of sync, letting you cordon the offending node before it degrades job performance further. It also provides continuous NCCL-level metrics on latency, bandwidth, message sizes, and rank-level communication, all surfaced in dedicated Grafana dashboards with minimal performance overhead. This guide details how to install and configure the Straggler Detection plugin within a SUNK cluster. After you complete this guide, your cluster automatically detects GPU hangs and surfaces NCCL performance data in Grafana.

Prerequisites

Before you begin, confirm your environment meets the following requirements:

SUNK v7.4.0 or later.
NCCL 2.28.2 or later in your container image.
At least 2 vCPUs allocated per task.

Enable the plugin

In your cluster’s slurm_values.yaml, set compute.gpusd.enabled to true. See Slurm parameter reference for more information.

compute:
  gpusd:
    enabled: true

This setting automatically:

Downloads and installs the GPUSD package on compute nodes at startup.
Exposes ports 10400-10407 on compute pods for metrics collection.
Deploys a VMPodScrape resource to scrape NCCL plugin metrics.

For jobs launched with --container or --container-image, an enroot hook automatically mounts the plugin into the container and sets NCCL_PROFILER_PLUGIN in the job environment. Container-based jobs don’t need additional configuration. For jobs running without a container, add the following environment variable to your batch job script:

export NCCL_PROFILER_PLUGIN=/usr/lib/libnccl-profiler-gpusd.so

Enable debugging information

To confirm the plugin loaded successfully and to capture useful logs during initial validation, enable NCCL debug output. On first run, enable export NCCL_DEBUG=INFO to print debugging information. If you’ve correctly loaded the plugin, the output includes a line resembling the following:

h200-205-187:1189647:1189647 [0] NCCL INFO Successfully loaded external profiler plugin /usr/lib/libnccl-profiler-gpusd.so

You can set the following optional environment variables to control performance metrics collection and logging verbosity:

Variable	Purpose
`GPUSD_PERF_DEBUG=0`	Disable performance metrics (hang detection only)
`GPUSD_PERF_DEBUG=1`	Enable performance metrics (always on)
`GPUSD_PERF_DEBUG=2`	Toggle metrics with `SIGUSR1` (on) or `SIGUSR2` (off)
`GPUSD_DEBUG=VERSION`	Minimal logging
`GPUSD_DEBUG=INFO`	Standard logging
`GPUSD_DEBUG=TRACE`	Verbose logging

View metrics in Grafana

The Slurm Job Metrics dashboard in CoreWeave Grafana surfaces Straggler Detection data across several panels:

Panel	Description
GPU Straggler Detection	Identifies the rank and node causing a hang
GPU Straggler Detection overlay	Overlays hung rank information onto the Slurm Job Metrics panel
NCCL Metrics row	Shows NCCL latency, throughput, message size, and slow GPUs

With the plugin enabled and metrics flowing, you can use these dashboards to identify straggler and slow GPUs during training runs and cordon affected nodes before they impact job completion.

​Prerequisites

​Enable the plugin

​Enable debugging information

​View metrics in Grafana

Prerequisites

Enable the plugin

Enable debugging information

View metrics in Grafana