Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt

Use this file to discover all available pages before exploring further.

Overview

In a large-scale model training, thousands of GPUs need to communicate with each other. When even a single GPU falls behind, whether because of a software glitch, a networking issue, or hardware degradation, the entire distributed training job slows down. These GPU “stragglers” don’t crash the job outright; they quietly stretch iteration time. This makes the task of detecting them much harder, and cannot be solved with existing logs and telemetry from a training run. The consequences are a reduction in training throughput, valuable researcher time lost due to silent slowdowns, and up to 10% of total training compute wasted. For most research teams, finding a straggler is a painful and time-consuming debugging exercise. Engineers pore over available logs, compare node performance by hand, and repeatedly resubmit jobs just to isolate which GPU is causing the slowdown. Often, teams don’t even have the necessary data and are ultimately unable to find the straggling GPU. Because distributed training depends on complex GPU-to-GPU communication, even experienced teams struggle to pinpoint the root cause quickly. CoreWeave’s Straggler Detection removes this guesswork entirely. It applies CoreWeave’s proprietary detection algorithms to fine-grained NCCL telemetry, isolating the exact GPU and node that are falling out of sync with the rest of the job. When a straggler is detected, customers receive a precise, actionable alert with recommended next steps, such as cordoning the offending node to prevent further scheduling. CoreWeave then cleanly removes the problematic node so the user can immediately reschedule the job and return to full-speed training. Alongside straggler detection, customers now gain access to new NCCL-level Prometheus metrics that dramatically improve observability. These metrics make it easier to understand GPU communication patterns, diagnose training bottlenecks, and troubleshoot multi-node inference behavior. The feature introduces negligible performance overhead, whether a job uses a single node (8 GPUs) or thousands. Together, these capabilities give customers far greater insight and control over their clusters, reducing wasted compute and accelerating iteration speed.

Enabling Straggler Detection

Straggler Detection is available in Private Preview for select customers and can be enabled with a simple cluster configuration update plus a few new environment variables in the job. Once enabled, CoreWeave Grafana dashboards surface NCCL telemetry for distributed jobs running on Slurm or other schedulers such as Kueue, with no workflow changes required. Customers can also disable NCCL telemetry collection and straggler detection at any time by updating their cluster configuration. For instructions, see Enable GPU Straggler Detection

Straggler Detection in action

When running distributed training on CoreWeave, customers gain real-time visibility into GPU communication behavior through a new set of NCCL performance metrics and purpose-built Grafana dashboards. These signals are surfaced in the Slurm Job Metrics dashboard, which exposes low-level signals such as NCCL collective communication latency, algorithmic bandwidth (AlgoBW), bus bandwidth (BusBW), message sizes, and more. Previously, gathering this data required stopping jobs and running nccl-tests, which created downtime and slowed iteration. With Straggler Detection, the same information is collected continuously from live jobs with negligible overhead. The dashboards are designed for fast diagnosis and help customers answer questions like:
  • Is my job slowing down due to GPU-to-GPU communication issues?
  • Which GPU or node is underperforming?
  • Is NVLink bandwidth performing as expected?
To simplify troubleshooting, the dashboards include visual annotations that overlay straggler detection signals on the Slurm job state timeline. This gives customers an immediate cue that something is wrong, and they can scroll down to the NCCL metrics and Straggler Detection panels for the underlying detail.
Slurm Job Metrics dashboard with GPU Straggler Detection annotations overlaid on the job state row

Example 1: Locating a slow GPU in SUNK

When running distributed training on CoreWeave using SUNK, users can correlate the timestamp of a detected straggler with their training logs and Weights & Biases job monitoring. This allows them to match symptoms in their training code with rank-level NCCL telemetry. The Straggler Detection table identifies the exact rank/GPU and node falling behind. Each row links out to detailed Node and Pod dashboards so customers can keep drilling into the health of the underlying resources. This reduces “job feels slow” debugging to pinpointing a specific GPU and node within seconds.
GPU Straggler Detection table with rank, node, and pod columns and drill-down links

Example 2: Debugging distributed training issues

When investigating reduced MFU or distributed-configuration inefficiencies in otherwise healthy jobs, NCCL telemetry significantly accelerates root-cause analysis by exposing rank-level detail across all communication groups. Dashboards present bandwidth and latency by collective and by rank for process groups of different sizes (for example, tensor parallel, pipeline parallel, and large data parallel groups). This helps users identify:
  • Collectives running significantly slower in specific groups
  • Ranks consistently lagging during AllReduce, AllGather, or Broadcast
  • Imbalances between small and large parallel groups
Previously, capturing bus bandwidth and collective latency required stopping the training run to execute a separate profiler such as nccl-tests, which left expensive GPUs idle. With Straggler Detection, the same panels are populated continuously from the live job, so customers can diagnose bottlenecks without restarting workloads or paying the cost of idle GPUs.
NCCL bus bandwidth per rank panel from the Slurm Job Metrics dashboard
NCCL collective latency per rank panel from the Slurm Job Metrics dashboard

Ongoing improvements

Straggler Detection and the surrounding NCCL observability tooling will continue to evolve during the Private Preview period. We will release feature updates, improvements, and expanded documentation on a rolling basis. For more information or to request access to the Private Preview, contact us.
Last modified on May 6, 2026