Skip to main content
CoreWeave Straggler Detection tracks each GPU in a distributed training job and alerts when one falls out of sync, letting you cordon the offending node before it degrades job performance further. It also provides continuous NCCL-level metrics on latency, bandwidth, message sizes, and rank-level communication, all surfaced in dedicated Grafana dashboards with minimal performance overhead. This guide details how to install and configure the Straggler Detection plugin within a SUNK cluster. After you complete this guide, your cluster automatically detects GPU hangs and surfaces NCCL performance data in Grafana.

Prerequisites

Before you begin, confirm your environment meets the following requirements:
  • SUNK v7.4.0 or later.
  • NCCL 2.28.2 or later in your container image.
  • At least 2 vCPUs allocated per task.

Enable the plugin

In your cluster’s slurm_values.yaml, set compute.gpusd.enabled to true. See Slurm parameter reference for more information.
compute:
  gpusd:
    enabled: true
This setting automatically:
  • Downloads and installs the GPUSD package on compute nodes at startup.
  • Exposes ports 10400-10407 on compute pods for metrics collection.
  • Deploys a VMPodScrape resource to scrape NCCL plugin metrics.
For jobs launched with --container or --container-image, an enroot hook automatically mounts the plugin into the container and sets NCCL_PROFILER_PLUGIN in the job environment. Container-based jobs don’t need additional configuration. For jobs running without a container, add the following environment variable to your batch job script:
export NCCL_PROFILER_PLUGIN=/usr/lib/libnccl-profiler-gpusd.so

Enable debugging information

To confirm the plugin loaded successfully and to capture useful logs during initial validation, enable NCCL debug output. On first run, enable export NCCL_DEBUG=INFO to print debugging information. If you’ve correctly loaded the plugin, the output includes a line resembling the following:
h200-205-187:1189647:1189647 [0] NCCL INFO Successfully loaded external profiler plugin /usr/lib/libnccl-profiler-gpusd.so
You can set the following optional environment variables to control performance metrics collection and logging verbosity:
VariablePurpose
GPUSD_PERF_DEBUG=0Disable performance metrics (hang detection only)
GPUSD_PERF_DEBUG=1Enable performance metrics (always on)
GPUSD_PERF_DEBUG=2Toggle metrics with SIGUSR1 (on) or SIGUSR2 (off)
GPUSD_DEBUG=VERSIONMinimal logging
GPUSD_DEBUG=INFOStandard logging
GPUSD_DEBUG=TRACEVerbose logging

View metrics in Grafana

The Slurm Job Metrics dashboard in CoreWeave Grafana surfaces Straggler Detection data across several panels:
PanelDescription
GPU Straggler DetectionIdentifies the rank and node causing a hang
GPU Straggler Detection overlayOverlays hung rank information onto the Slurm Job Metrics panel
NCCL Metrics rowShows NCCL latency, throughput, message size, and slow GPUs
With the plugin enabled and metrics flowing, you can use these dashboards to identify straggler and slow GPUs during training runs and cordon affected nodes before they impact job completion.
Last modified on May 27, 2026