> ## Documentation Index
> Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Drain and undrain Slurm nodes

> Identify the reason a node was drained, manually drain and undrain nodes

This page shows Slurm administrators how to identify why a node was drained, manually drain a node for maintenance, and return drained nodes to service. Use these procedures when you need to temporarily remove a node from scheduling, investigate an automatic drain, or restore nodes after resolving an underlying issue.

Draining a Slurm node prevents it from accepting new jobs while letting any running jobs complete. After a node is drained, it enters a `drained` state and remains unavailable for scheduling until it's undrained (resumed). Unlike Kubernetes node draining, which evicts running pods, Slurm draining doesn't terminate running workloads. It only stops new jobs from being assigned to the node.

SUNK login nodes include built-in aliases for draining, undraining, and monitoring node states. These are available in every SUNK login node shell by default:

| Alias                        | Description                                                           |
| ---------------------------- | --------------------------------------------------------------------- |
| `drain [NODE-NAME] [REASON]` | Drain a node with a reason, automatically prefixed with your username |
| `undrain [NODE-NAME]`        | Undrain (resume) a single node                                        |
| `dl`                         | List all nodes in `drain` and `idle` state with reasons               |
| `dld`                        | List all nodes in `drain` state with reasons                          |
| `sn`                         | Show detailed node information (`scontrol show node`)                 |

## When nodes are drained automatically

Before you manually drain or undrain a node, it helps to understand the cases where SUNK drains nodes on its own and resolves them without user intervention.

CoreWeave's [HPC Verification](/platform/fleet-management/hpc-verification) framework periodically checks the health of all compute nodes in a cluster. If a health check detects an issue, SUNK automatically drains the affected nodes. In some cases, CoreWeave's services automatically drain, restart, recheck, and undrain the node with no action required from you.

Nodes can also be drained in response to an underlying error or event, such as a failing [prolog or epilog script](/products/sunk/run_workloads/prolog-epilog). These cases can require troubleshooting to determine the underlying cause.

### Automatic undrain with health checks

When a drain reason includes `sunk:verify-undrain`, SUNK automatically undrains the node after it passes the next hourly [HPC Verification](/platform/fleet-management/hpc-verification) health check. No user action is required. If the node fails the health check, it remains drained and you must investigate the underlying issue.

This mechanism also applies when you manually drain a node. If you drain a node and include `sunk:verify-undrain` in the reason, SUNK automatically returns the node to service once it passes the next health check. This is useful when you want to temporarily remove a node from service and have it automatically return after CoreWeave's health checks confirm the node is healthy.

For example, to drain a node and have it automatically undrain after a passing health check:

```bash theme={"system"}
drain [NODE-NAME] "investigating issue (sunk:verify-undrain)"
```

If you have a compute node in a `drained` state, you can [identify the drain reason](#identify-the-drain-reason) to determine whether you need to take action. Outside the automated HPC Verification cycle, you can [manually drain a node](#manually-drain-a-slurm-node) if you need to temporarily prevent it from accepting new jobs, such as during maintenance, and [manually undrain the node](#manually-undrain-a-slurm-node) to return it to service.

CoreWeave's documentation also includes an overview of [Slurm node states](/products/sunk/manage_sunk/slurm-node-states), and further information about [nodes in the `INVAL` state](/products/sunk/manage_sunk/inval-node-state).

## Connect to the Slurm login pod

To drain and undrain nodes, first [connect to the Slurm login pod](/products/sunk/access_sunk/connect-to-slurm-login-node#connect-through-ssh).

After connecting to the Slurm login node, you can use the built-in aliases and `scontrol` commands in the following sections to examine and manage drained nodes.

<Tip>
  Run all Slurm commands, including `scontrol`, from within the Slurm login pod shell.
</Tip>

## Identify the drain reason

Before deciding whether to undrain a node, check the drain reason so you know whether the node is recovering automatically or requires manual intervention.

To find out why a node is in a drain state, use the `scontrol show node` command:

```bash theme={"system"}
scontrol show node [NODE-NAME]
```

Replace `[NODE-NAME]` with the actual name of the node you want to check, or remove the `[NODE-NAME]` entirely to list all nodes.

The output of these commands shows a `reason` for the node's drain state. Based on the reason listed, you can determine whether the `drained` state is due to a Kubernetes event or a Slurm issue and proceed accordingly.

<Note>
  An asterisk `*` alongside a node state indicates that the node isn't responding. Nodes in a `drain*` or `down*` state have been removed from the cluster and can be ignored. You can see this suffix if you check the state while the pod isn't yet fully connected.
</Note>

### Aliases for node monitoring

CoreWeave provides several built-in aliases as part of the SUNK login node image for monitoring node states. These are available in every login node shell session.

The `sn` alias runs `scontrol show node`:

```bash theme={"system"}
sn
```

The `dl` alias lists all nodes in `drain` and `idle` state, along with the reason:

```bash theme={"system"}
dl
```

The `dl` alias is equivalent to:

```bash theme={"system"}
sinfo -t "drain&idle" -NO "NodeList:45,Comment:10,Timestamp:25,Reason:130" | uniq
```

The `dld` alias lists all nodes in `drain` state (including those actively draining with running jobs):

```bash theme={"system"}
dld
```

## Manually drain a Slurm node

Generally, SUNK automatically drains Slurm nodes when it finds an issue. If you plan to perform maintenance on a node or want to temporarily remove it from service, you can drain it manually.

The simplest way to drain a node is with the built-in `drain` alias:

```bash theme={"system"}
drain [NODE-NAME] "the reason for draining"
```

The `drain` alias automatically prefixes the reason with your username, resulting in a drain reason in the following format. Replace `[USERNAME]` with your username and `[REASON]` with the reason you provide.

```text theme={"system"}
[USERNAME]: [REASON]
```

To drain a node and have SUNK automatically undrain it after it passes the next [HPC Verification](/platform/fleet-management/hpc-verification) health check, include `sunk:verify-undrain` in the reason:

```bash theme={"system"}
drain [NODE-NAME] "investigating issue (sunk:verify-undrain)"
```

Alternatively, you can use `scontrol update` directly:

```bash theme={"system"}
scontrol update nodename=[NODE-NAME] state=drain reason="the reason for draining"
```

Replace `[NODE-NAME]` with the name of the node you're draining.

## Manually undrain a Slurm node

SUNK automatically undrains drained nodes with `sunk:verify-undrain` listed in the drain reason after they pass the next hourly [HPC Verification](/platform/fleet-management/hpc-verification) health check, with no user action required. You can manually undrain these nodes if you don't want to wait for the automation to undrain them.

If you manually drained the node and have already corrected any underlying issues, you can manually undrain the node at your discretion.

### Undrain a single node

The simplest way to undrain a node is with the built-in `undrain` alias:

```bash theme={"system"}
undrain [NODE-NAME]
```

Alternatively, use `scontrol update` directly to change the node `state` to `resume`:

```bash theme={"system"}
scontrol update nodename=[NODE-NAME] state=resume
```

Replace `[NODE-NAME]` with the name of the node you're undraining.

Monitor the node after undraining it. If the issue that caused the automatic drain persists, SUNK can automatically drain the node again the next time it attempts to run a job.

### Undrain all drained nodes

To undrain all nodes currently in a `drain` or `idle` state, use the following command:

```bash theme={"system"}
for node in $(dl | cut -d " " -f 1); do
    echo "undraining $node";
    scontrol update nodename=$node state=resume;
done
```

This command uses the [`dl` alias](#aliases-for-node-monitoring).

### Undrain nodes by reason

Use `grep` to undrain nodes that have been drained for a specific reason. Before you undrain nodes in this manner, confirm which nodes the script processes:

```bash theme={"system"}
dl | grep [REASON]
```

After you confirm that the listed nodes match those you intend to undrain, undrain them with the following command.

The following example undrains only nodes with `scheduler: k8s pod deletion timeout for job` listed in the drain reason:

```bash theme={"system"}
for node in $(dl | grep "scheduler: k8s pod deletion timeout for job" | cut -d " " -f 1); do
    echo "undraining $node";
    scontrol update nodename=$node state=resume;
done
```

This command uses the [`dl` alias](#aliases-for-node-monitoring).

## Common drain reasons

The following sections describe common drain reasons grouped by source, and indicate whether the node typically recovers on its own or requires manual intervention.

### Drain reasons related to Kubernetes

If the drain reason starts with `k8s:`, SUNK has drained the node due to a Kubernetes-related event, such as a cordoned node. Draining for these reasons is often temporary and resolves automatically.

<Note>
  A drain reason of `k8s: pod scheduled for deletion` indicates that the node is waiting to update and doesn't accept new jobs to avoid disrupting active work. This often appears on Slurm nodes when the Kubernetes NodeSet is updated. After the Compute pod restarts, Slurm undrains the node without any action needed from you.
</Note>

### Drain reasons related to Slurm

If the reason doesn't include `k8s`, this likely indicates a Slurm failure, and you might need to manually [undrain the node](#manually-undrain-a-slurm-node) after fixing any underlying problems.

If the drain reason contains `sunk:verify-undrain`, SUNK automatically returns the node to service after it passes the next hourly [HPC Verification](/platform/fleet-management/hpc-verification) health check. If the node fails this health check, you must investigate the underlying issue. You can view the output of these health checks in the [Node Details](/observability/managed-grafana/fleet/node-details) Grafana dashboard, or with the `sn` command.

The following table lists common Slurm-related drain reasons:

| Drain reason                          | Meaning                                                                                                                | Safe to undrain?                               |
| ------------------------------------- | ---------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------- |
| `NHC: [REASON]`                       | There was a Node Health Check failure during the epilog task.                                                          | Yes.                                           |
| `sunk:verify-undrain`                 | SUNK automatically returns the node to service after the next passing HPC Verification check.                          | Yes.                                           |
| `prolog pre-hook failed`              | A pre-hook task failed before starting the Slurm job. This is typically a temporary issue that resolves automatically. | Yes.                                           |
| `scheduler: k8s pod deletion timeout` | Slurm detected a failure during the `scheduler-epilog.sh` task.                                                        | Yes.                                           |
| `batch job complete failure`          | There have been intermittent errors with Slurm, or there has been a node failure.                                      | Yes, if you have verified the node is healthy. |
