Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt

Use this file to discover all available pages before exploring further.

CoreWeave Straggler Detection tracks each GPU in a distributed training job and alerts when one falls out of sync, letting you cordon the offending node before it degrades job performance further. It also provides continuous NCCL-level metrics on latency, bandwidth, message sizes, and rank-level communication, all surfaced in dedicated Grafana dashboards with negligible performance overhead. This guide details how to install and configure the Straggler Detection plugin within a SUNK cluster. After completing this guide, your cluster will automatically detect GPU hangs and surface NCCL performance data in Grafana.

Prerequisites

Before you begin, confirm your environment meets the following requirements:
  • SUNK v7.4.0 or later
  • NCCL 2.28.2 or later in your container image
  • At least 2 vCPUs allocated per task

Enable the plugin

In your cluster’s slurm_values.yaml, set compute.gpusd.enabled to true. See Slurm parameter reference for more information.
compute:
  gpusd:
    enabled: true
This single setting automatically:
  • Downloads and installs the GPUSD package on compute nodes at startup.
  • Exposes ports 10400-10407 on compute pods for metrics collection.
  • Deploys a VMPodScrape resource to scrape NCCL plugin metrics.
For jobs launched with --container or --container-image, an enroot hook automatically mounts the plugin into the container and sets NCCL_PROFILER_PLUGIN in the job environment. No additional configuration is needed for container-based jobs. For jobs running without a container, add the following environment variable to your batch job script:
export NCCL_PROFILER_PLUGIN=/usr/lib/libnccl-profiler-gpusd.so

Enable debugging information

On first run, we recommend enabling export NCCL_DEBUG=INFO, which prints debugging information. If you have correctly loaded the plugin, the output includes a line resembling the following:
h200-205-187:1189647:1189647 [0] NCCL INFO Successfully loaded external profiler plugin /usr/lib/libnccl-profiler-gpusd.so
Other optional environment variables include:
VariablePurpose
GPUSD_PERF_DEBUG=0Disable performance metrics (hang detection only)
GPUSD_PERF_DEBUG=1Enable performance metrics (always on)
GPUSD_PERF_DEBUG=2Toggle metrics with SIGUSR1 (on) or SIGUSR2 (off)
GPUSD_DEBUG=VERSIONMinimal logging
GPUSD_DEBUG=INFOStandard logging
GPUSD_DEBUG=TRACEVerbose logging

View metrics in Grafana

The Slurm Job Metrics dashboard in CoreWeave Grafana surfaces Straggler Detection data across several panels:
PanelDescription
GPU Straggler DetectionIdentifies the rank and node causing a hang
GPU Straggler Detection overlayOverlays hung rank information onto the Slurm Job Metrics panel
NCCL Metrics rowShows NCCL latency, throughput, message size, and slow GPUs
With the plugin enabled and metrics flowing, you can use these dashboards to identify straggler and slow GPUs during training runs and cordon affected nodes before they impact job completion.
Last modified on May 1, 2026