Drain and undrain Slurm nodes

Identify the reason a node was drained, manually drain and undrain nodes

CoreWeave's HPC verification framework periodically checks the health of all compute nodes in a cluster. If an issue is detected during one of these health checks, the affected nodes will be drained, or temporarily removed from service. In some cases, CoreWeave's services automatically attempt to drain, restart, recheck, and undrain the node, with no action required from the user.

However, nodes that have drained in response to an underlying error or event, such as a failing prolog or epilog script, may require troubleshooting to determine the underlying cause.

If you have a compute node in a drained state, you can identify the drain reason to determine whether action needs to be taken. Outside of the automated HPC verification cycle, you can manually drain a node if you need to temporarily prevent it from accepting new jobs, such as when performing maintenance, and manually undrain the node to return it to service.

This guide provides an overview of Slurm drain reasons, and will demonstrate how to manually drain and undrain nodes.

CoreWeave's documentation also includes an overview of Slurm node states, and further information about nodes in the INVAL state.

To drain and undrain nodes with scontrol, first connect to the Slurm login pod.

After connecting to the Slurm login node, you can use the scontrol commands detailed below to examine and manage drained nodes.

Tip

Run all Slurm commands, including scontrol, from within the Slurm login pod shell.

Identify the drain reason

To find out why a node is in a drain state, use the scontrol show node command:

Example

$
scontrol show node <node-name>

Replace <node-name> with the actual name of the node you want to check, or remove the <node-name> entirely to list all nodes.

The output of these commands shows a reason for the node's drain state. Based on the reason listed, you can determine if the drained state is due to a Kubernetes event or a Slurm issue and proceed accordingly.

Note

An asterisk * alongside a node state indicates that the node is not responding. Nodes in a drain* or down* state have been removed from the cluster and can be ignored. You may see this suffix if you checked the state while the Pod was not yet fully connected.

Command aliases for node monitoring

Alternatively, CoreWeave provides some useful aliases as part of the SUNK image.

The sn alias runs the scontrol show node command shown above.

Example

$
sn

The dl alias lists all nodes that are idle or in drain, along with the reason for the state.

Example

$
dl

The dl alias runs the following commands:

Example

$
sinfo -t "drain&idle" -NO "NodeList:45,Comment:10,Timestamp:25,Reason:130" | uniq

Manually drain a Slurm node

Generally, Slurm nodes will be automatically drained if an issue is found. If you plan to perform maintenance on a node, you may wish to drain it manually.

To manually drain a Slurm node, use scontrol update to change the node state to drain, as shown below:

Example

$
scontrol update nodename=<node-name> state=drain reason="the reason for draining"

Replace <node-name> with the name of the node you're draining.

The drain reason reflects that the node has been manually drained, and indicates the user who initiated the drain, in the following format:

Example

<username>: <reason given by user>

Manually undrain a Slurm node

Drained nodes with sunk:verify-undrain listed in the drain reason will be automatically undrained with no user action required. You can manually undrain these nodes if you do not want to wait for the automation to undrain it.

If you manually drained the node and have already corrected any underlying issues, you can proceed to manually undrain the node at your discretion.

Undrain a single node with `scontrol`

To undrain a node and allow it to start accepting jobs again, use the scontrol update command to change the node state to resume, as shown below:

Example

$
scontrol update nodename=<node-name> state=resume

Replace <node-name> with the name of the node you're undraining.

Monitor the node after undraining it. If the issue that caused the automatic drain persists, the node may be automatically drained again the next time it tries to run a job.

Undrain all drained nodes

To undrain all nodes currently in a drain or idle state, use the following command:

Example

$
for node in $(dl | cut -d " " -f 1); do
    echo "undraining $node";
    scontrol update nodename=$node state=resume;
done

Note that this command makes use of the dl alias described above.

Undrain nodes by reason

Use grep to undrain nodes that have been drained for a specific reason. Before undraining nodes in this manner, run the following command to confirm which nodes will be processed:

Example

$
dl | grep <reason>

After confirming that the listed nodes match those intended to be undrained, proceed to undrain them as detailed below.

In this example, we undrain only nodes with scheduler: k8s pod deletion timeout for job listed in the drain reason:

Example

$
for node in $(dl | grep "scheduler: k8s pod deletion timeout for job" | cut -d " " -f 1); do
    echo "undraining $node";
    scontrol update nodename=$node state=resume;
done

Note that this command makes use of the dl alias described above.

Connect to the Slurm login pod​

Identify the drain reason​

Command aliases for node monitoring​

Manually drain a Slurm node​

Manually undrain a Slurm node​

Undrain a single node with scontrol​

Undrain all drained nodes​

Undrain nodes by reason​

Connect to the Slurm login pod

Identify the drain reason

Command aliases for node monitoring

Manually drain a Slurm node

Manually undrain a Slurm node

Undrain a single node with `scontrol`

Undrain all drained nodes

Undrain nodes by reason