CoreWeave Straggler Detection tracks each GPU in a distributed training job and alerts when one falls out of sync, letting you cordon the offending node before it degrades job performance further. It also provides continuous NCCL-level metrics on latency, bandwidth, message sizes, and rank-level communication, all surfaced in dedicated Grafana dashboards with negligible performance overhead. This guide details how to install and configure the Straggler Detection plugin within a SUNK cluster. After completing this guide, your cluster will automatically detect GPU hangs and surface NCCL performance data in Grafana.Documentation Index
Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
Use this file to discover all available pages before exploring further.
Prerequisites
Before you begin, confirm your environment meets the following requirements:- SUNK v7.4.0 or later
- NCCL 2.28.2 or later in your container image
- At least 2 vCPUs allocated per task
Enable the plugin
In your cluster’sslurm_values.yaml, set compute.gpusd.enabled to true. See Slurm parameter reference for more information.
- Downloads and installs the GPUSD package on compute nodes at startup.
- Exposes ports
10400-10407on compute pods for metrics collection. - Deploys a VMPodScrape resource to scrape NCCL plugin metrics.
--container or --container-image, an enroot hook automatically mounts the plugin into the container and sets NCCL_PROFILER_PLUGIN in the job environment. No additional configuration is needed for container-based jobs.
For jobs running without a container, add the following environment variable to your batch job script:
Enable debugging information
On first run, we recommend enablingexport NCCL_DEBUG=INFO, which prints debugging information. If you have correctly loaded the plugin, the output includes a line resembling the following:
| Variable | Purpose |
|---|---|
GPUSD_PERF_DEBUG=0 | Disable performance metrics (hang detection only) |
GPUSD_PERF_DEBUG=1 | Enable performance metrics (always on) |
GPUSD_PERF_DEBUG=2 | Toggle metrics with SIGUSR1 (on) or SIGUSR2 (off) |
GPUSD_DEBUG=VERSION | Minimal logging |
GPUSD_DEBUG=INFO | Standard logging |
GPUSD_DEBUG=TRACE | Verbose logging |
View metrics in Grafana
The Slurm Job Metrics dashboard in CoreWeave Grafana surfaces Straggler Detection data across several panels:| Panel | Description |
|---|---|
| GPU Straggler Detection | Identifies the rank and node causing a hang |
| GPU Straggler Detection overlay | Overlays hung rank information onto the Slurm Job Metrics panel |
| NCCL Metrics row | Shows NCCL latency, throughput, message size, and slow GPUs |