> ## Documentation Index
> Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Enable GPU straggler detection

> Install and configure the GPU Straggler Detection plugin in SUNK to monitor GPUs and detect distributed-job hangs.

[CoreWeave Straggler Detection](/products/sunk/discover_sunk/straggler-detection) tracks each GPU in a distributed training job and alerts when one falls out of sync, letting you cordon the offending node before it degrades job performance further. It also provides continuous NCCL-level metrics on latency, bandwidth, message sizes, and rank-level communication, all surfaced in dedicated [Grafana dashboards](/observability/managed-grafana/sunk/slurm-job-metrics) with minimal performance overhead.

This guide details how to install and configure the Straggler Detection plugin within a SUNK cluster. After you complete this guide, your cluster automatically detects GPU hangs and surfaces NCCL performance data in Grafana.

## Prerequisites

Before you begin, confirm your environment meets the following requirements:

* SUNK v7.4.0 or later.
* NCCL 2.28.2 or later in your container image.
* At least 2 vCPUs allocated per task.

## Enable the plugin

In your cluster's `slurm_values.yaml`, set `compute.gpusd.enabled` to `true`. See [Slurm parameter reference](/products/sunk/reference/slurm-parameters) for more information.

```yaml theme={"system"}
compute:
  gpusd:
    enabled: true
```

This setting automatically:

* Downloads and installs the GPUSD package on compute nodes at startup.
* Exposes ports `10400-10407` on compute pods for metrics collection.
* Deploys a VMPodScrape resource to scrape NCCL plugin metrics.

For jobs launched with `--container` or `--container-image`, an enroot hook automatically mounts the plugin into the container and sets `NCCL_PROFILER_PLUGIN` in the job environment. Container-based jobs don't need additional configuration.

For jobs running without a container, add the following environment variable to your batch job script:

```bash theme={"system"}
export NCCL_PROFILER_PLUGIN=/usr/lib/libnccl-profiler-gpusd.so
```

### Enable debugging information

To confirm the plugin loaded successfully and to capture useful logs during initial validation, enable NCCL debug output.

On first run, enable `export NCCL_DEBUG=INFO` to print debugging information. If you've correctly loaded the plugin, the output includes a line resembling the following:

```text theme={"system"}
h200-205-187:1189647:1189647 [0] NCCL INFO Successfully loaded external profiler plugin /usr/lib/libnccl-profiler-gpusd.so
```

You can set the following optional environment variables to control performance metrics collection and logging verbosity:

| Variable              | Purpose                                               |
| --------------------- | ----------------------------------------------------- |
| `GPUSD_PERF_DEBUG=0`  | Disable performance metrics (hang detection only)     |
| `GPUSD_PERF_DEBUG=1`  | Enable performance metrics (always on)                |
| `GPUSD_PERF_DEBUG=2`  | Toggle metrics with `SIGUSR1` (on) or `SIGUSR2` (off) |
| `GPUSD_DEBUG=VERSION` | Minimal logging                                       |
| `GPUSD_DEBUG=INFO`    | Standard logging                                      |
| `GPUSD_DEBUG=TRACE`   | Verbose logging                                       |

## View metrics in Grafana

The [Slurm Job Metrics](/observability/managed-grafana/sunk/slurm-job-metrics) dashboard in CoreWeave Grafana surfaces Straggler Detection data across several panels:

| Panel                           | Description                                                     |
| ------------------------------- | --------------------------------------------------------------- |
| GPU Straggler Detection         | Identifies the rank and node causing a hang                     |
| GPU Straggler Detection overlay | Overlays hung rank information onto the Slurm Job Metrics panel |
| NCCL Metrics row                | Shows NCCL latency, throughput, message size, and slow GPUs     |

With the plugin enabled and metrics flowing, you can use these dashboards to identify straggler and slow GPUs during training runs and cordon affected nodes before they impact job completion.
