Introduction to GPU Straggler Detection
Learn about CoreWeave Mission Control™ GPU Straggler Detection
Overview
In a large-scale model training, thousands of GPUs need to communicate with each other. When even a single GPU falls behind, whether because of a software glitch, a networking issue, or hardware degradation, the entire distributed training job slows down. These GPU "stragglers" don't crash the job outright; they quietly stretch iteration time. This makes the task of detecting them much harder, and cannot be solved with existing logs and telemetry from a training run. The consequences are a reduction in training throughput, valuable researcher time lost due to silent slowdowns, and up to 10% of total training compute wasted.
For most research teams, finding a straggler is a painful and time-consuming debugging exercise. Engineers pore over available logs, compare node performance by hand, and repeatedly resubmit jobs just to isolate which GPU is causing the slowdown. Often, teams don't even have the necessary data and are ultimately unable to find the straggling GPU. Because distributed training depends on complex GPU-to-GPU communication, even experienced teams struggle to pinpoint the root cause quickly.
CoreWeave's Straggler Detection removes this guesswork entirely. It applies CoreWeave's proprietary detection algorithms to fine-grained NCCL telemetry, isolating the exact GPU and node that are falling out of sync with the rest of the job. When a straggler is detected, customers receive a precise, actionable alert with recommended next steps, such as cordoning the offending node to prevent further scheduling. CoreWeave then cleanly removes the problematic node so the user can immediately reschedule the job and return to full-speed training.
Alongside straggler detection, customers now gain access to new NCCL-level Prometheus metrics that dramatically improve observability. These metrics make it easier to understand GPU communication patterns, diagnose training bottlenecks, and troubleshoot multi-node inference behavior. The feature introduces negligible performance overhead, whether a job uses a single node (8 GPUs) or thousands. Together, these capabilities give customers far greater insight and control over their clusters, reducing wasted compute and accelerating iteration speed.
Enabling Straggler Detection
Straggler Detection is available in Private Preview for select customers and can be enabled with a simple cluster configuration update plus a few new environment variables in the job. Once enabled, CoreWeave Grafana dashboards surface NCCL telemetry for distributed jobs running on Slurm or other schedulers such as Kueue, with no workflow changes required. Customers can also disable NCCL telemetry collection and straggler detection at any time by updating their cluster configuration.
Straggler Detection in action
When running distributed training on CoreWeave, customers gain real-time visibility into GPU communication behavior through a new set of NCCL performance metrics and purpose-built Grafana dashboards. These dashboards expose low-level signals such as NCCL collective communication latency, algorithmic bandwidth (AlgoBW), bus bandwidth (BusBW), message sizes, and more. Previously, gathering this data required stopping jobs and running nccl-tests, which created downtime and slowed iteration. With Straggler Detection, the same information is collected continuously from live jobs with negligible overhead. The dashboards are designed for fast diagnosis and help customers answer questions like:
- Is my job slowing down due to GPU-to-GPU communication issues?
- Which GPU or node is underperforming?
- Is NVLink bandwidth performing as expected?
To simplify troubleshooting, the dashboards include visual annotations that place straggler detection signals alongside job state timelines and FLOPs graphs. This makes it easy to correlate NCCL behavior with job progression and training efficiency.
Example 1: Locating a slow GPU in SUNK
When running distributed training on CoreWeave using SUNK, users can correlate the timestamp of a detected straggler with their training logs and Weights & Biases job monitoring. This allows them to match symptoms in their training code with rank-level NCCL telemetry.
The Straggler Detection table identifies the exact rank/GPU and node falling behind. Each entry links to a detailed dashboard with additional node-level signals, making it straightforward to determine whether the slowdown originates from hardware, networking, or the training code.
This reduces "job feels slow" debugging to pinpointing a specific GPU and node within seconds.
Example 2: Debugging distributed training issues
When investigating reduced MFU or distributed-configuration inefficiencies in otherwise healthy jobs, NCCL telemetry significantly accelerates root-cause analysis by exposing rank-level detail across all communication groups.
Dashboards present bandwidth and latency by collective and by rank for process groups of different sizes (for example, tensor parallel, pipeline parallel, and large data parallel groups). This helps users identify:
- Collectives running significantly slower in specific groups
- Ranks consistently lagging during AllReduce, AllGather, or Broadcast
- Imbalances between small and large parallel groups
Because this data is collected from the production job, users can diagnose bottlenecks without running a profiler or restarting workloads, reducing both iteration time and GPU cost.
Ongoing improvements
Straggler Detection and the surrounding NCCL observability tooling will continue to evolve during the Private Preview period. We will release feature updates, improvements, and expanded documentation on a rolling basis.
For more information or to request access to the Private Preview, contact us.