Drain and undrain Slurm nodes

This page shows Slurm administrators how to identify why a node was drained, manually drain a node for maintenance, and return drained nodes to service. Use these procedures when you need to temporarily remove a node from scheduling, investigate an automatic drain, or restore nodes after resolving an underlying issue. Draining a Slurm node prevents it from accepting new jobs while letting any running jobs complete. After a node is drained, it enters a drained state and remains unavailable for scheduling until it’s undrained (resumed). Unlike Kubernetes node draining, which evicts running pods, Slurm draining doesn’t terminate running workloads. It only stops new jobs from being assigned to the node. SUNK login nodes include built-in aliases for draining, undraining, and monitoring node states. These are available in every SUNK login node shell by default:

Alias	Description
`drain [NODE-NAME] [REASON]`	Drain a node with a reason, automatically prefixed with your username
`undrain [NODE-NAME]`	Undrain (resume) a single node
`dl`	List all nodes in `drain` and `idle` state with reasons
`dld`	List all nodes in `drain` state with reasons
`sn`	Show detailed node information (`scontrol show node`)

When nodes are drained automatically

Before you manually drain or undrain a node, it helps to understand the cases where SUNK drains nodes on its own and resolves them without user intervention. CoreWeave’s HPC Verification framework periodically checks the health of all compute nodes in a cluster. If a health check detects an issue, SUNK automatically drains the affected nodes. In some cases, CoreWeave’s services automatically drain, restart, recheck, and undrain the node with no action required from you. Nodes can also be drained in response to an underlying error or event, such as a failing prolog or epilog script. These cases can require troubleshooting to determine the underlying cause.

Automatic undrain with health checks

When a drain reason includes sunk:verify-undrain, SUNK automatically undrains the node after it passes the next hourly HPC Verification health check. No user action is required. If the node fails the health check, it remains drained and you must investigate the underlying issue. This mechanism also applies when you manually drain a node. If you drain a node and include sunk:verify-undrain in the reason, SUNK automatically returns the node to service once it passes the next health check. This is useful when you want to temporarily remove a node from service and have it automatically return after CoreWeave’s health checks confirm the node is healthy. For example, to drain a node and have it automatically undrain after a passing health check:

drain [NODE-NAME] "investigating issue (sunk:verify-undrain)"

If you have a compute node in a drained state, you can identify the drain reason to determine whether you need to take action. Outside the automated HPC Verification cycle, you can manually drain a node if you need to temporarily prevent it from accepting new jobs, such as during maintenance, and manually undrain the node to return it to service. CoreWeave’s documentation also includes an overview of Slurm node states, and further information about nodes in the INVAL state. To drain and undrain nodes, first connect to the Slurm login pod. After connecting to the Slurm login node, you can use the built-in aliases and scontrol commands in the following sections to examine and manage drained nodes.

Run all Slurm commands, including scontrol, from within the Slurm login pod shell.

Identify the drain reason

Before deciding whether to undrain a node, check the drain reason so you know whether the node is recovering automatically or requires manual intervention. To find out why a node is in a drain state, use the scontrol show node command:

scontrol show node [NODE-NAME]

Replace [NODE-NAME] with the actual name of the node you want to check, or remove the [NODE-NAME] entirely to list all nodes. The output of these commands shows a reason for the node’s drain state. Based on the reason listed, you can determine whether the drained state is due to a Kubernetes event or a Slurm issue and proceed accordingly.

An asterisk * alongside a node state indicates that the node isn’t responding. Nodes in a drain* or down* state have been removed from the cluster and can be ignored. You can see this suffix if you check the state while the pod isn’t yet fully connected.

Aliases for node monitoring

CoreWeave provides several built-in aliases as part of the SUNK login node image for monitoring node states. These are available in every login node shell session. The sn alias runs scontrol show node:

sn

The dl alias lists all nodes in drain and idle state, along with the reason:

dl

The dl alias is equivalent to:

sinfo -t "drain&idle" -NO "NodeList:45,Comment:10,Timestamp:25,Reason:130" | uniq

The dld alias lists all nodes in drain state (including those actively draining with running jobs):

dld

Manually drain a Slurm node

Generally, SUNK automatically drains Slurm nodes when it finds an issue. If you plan to perform maintenance on a node or want to temporarily remove it from service, you can drain it manually. The simplest way to drain a node is with the built-in drain alias:

drain [NODE-NAME] "the reason for draining"

The drain alias automatically prefixes the reason with your username, resulting in a drain reason in the following format. Replace [USERNAME] with your username and [REASON] with the reason you provide.

[USERNAME]: [REASON]

To drain a node and have SUNK automatically undrain it after it passes the next HPC Verification health check, include sunk:verify-undrain in the reason:

drain [NODE-NAME] "investigating issue (sunk:verify-undrain)"

Alternatively, you can use scontrol update directly:

scontrol update nodename=[NODE-NAME] state=drain reason="the reason for draining"

Replace [NODE-NAME] with the name of the node you’re draining.

Manually undrain a Slurm node

SUNK automatically undrains drained nodes with sunk:verify-undrain listed in the drain reason after they pass the next hourly HPC Verification health check, with no user action required. You can manually undrain these nodes if you don’t want to wait for the automation to undrain them. If you manually drained the node and have already corrected any underlying issues, you can manually undrain the node at your discretion.

Undrain a single node

The simplest way to undrain a node is with the built-in undrain alias:

undrain [NODE-NAME]

Alternatively, use scontrol update directly to change the node state to resume:

scontrol update nodename=[NODE-NAME] state=resume

Replace [NODE-NAME] with the name of the node you’re undraining. Monitor the node after undraining it. If the issue that caused the automatic drain persists, SUNK can automatically drain the node again the next time it attempts to run a job.

Undrain all drained nodes

To undrain all nodes currently in a drain or idle state, use the following command:

for node in $(dl | cut -d " " -f 1); do
    echo "undraining $node";
    scontrol update nodename=$node state=resume;
done

This command uses the dl alias.

Undrain nodes by reason

Use grep to undrain nodes that have been drained for a specific reason. Before you undrain nodes in this manner, confirm which nodes the script processes:

dl | grep [REASON]

After you confirm that the listed nodes match those you intend to undrain, undrain them with the following command. The following example undrains only nodes with scheduler: k8s pod deletion timeout for job listed in the drain reason:

for node in $(dl | grep "scheduler: k8s pod deletion timeout for job" | cut -d " " -f 1); do
    echo "undraining $node";
    scontrol update nodename=$node state=resume;
done

This command uses the dl alias.

Common drain reasons

The following sections describe common drain reasons grouped by source, and indicate whether the node typically recovers on its own or requires manual intervention. If the drain reason starts with k8s:, SUNK has drained the node due to a Kubernetes-related event, such as a cordoned node. Draining for these reasons is often temporary and resolves automatically.

A drain reason of k8s: pod scheduled for deletion indicates that the node is waiting to update and doesn’t accept new jobs to avoid disrupting active work. This often appears on Slurm nodes when the Kubernetes NodeSet is updated. After the Compute pod restarts, Slurm undrains the node without any action needed from you.

If the reason doesn’t include k8s, this likely indicates a Slurm failure, and you might need to manually undrain the node after fixing any underlying problems. If the drain reason contains sunk:verify-undrain, SUNK automatically returns the node to service after it passes the next hourly HPC Verification health check. If the node fails this health check, you must investigate the underlying issue. You can view the output of these health checks in the Node Details Grafana dashboard, or with the sn command. The following table lists common Slurm-related drain reasons:

Drain reason	Meaning	Safe to undrain?
`NHC: [REASON]`	There was a Node Health Check failure during the epilog task.	Yes.
`sunk:verify-undrain`	SUNK automatically returns the node to service after the next passing HPC Verification check.	Yes.
`prolog pre-hook failed`	A pre-hook task failed before starting the Slurm job. This is typically a temporary issue that resolves automatically.	Yes.
`scheduler: k8s pod deletion timeout`	Slurm detected a failure during the `scheduler-epilog.sh` task.	Yes.
`batch job complete failure`	There have been intermittent errors with Slurm, or there has been a node failure.	Yes, if you have verified the node is healthy.

​When nodes are drained automatically

​Automatic undrain with health checks

​Connect to the Slurm login pod

​Identify the drain reason

​Aliases for node monitoring

​Manually drain a Slurm node

​Manually undrain a Slurm node

​Undrain a single node

​Undrain all drained nodes

​Undrain nodes by reason

​Common drain reasons

​Drain reasons related to Kubernetes

​Drain reasons related to Slurm

When nodes are drained automatically

Automatic undrain with health checks

Connect to the Slurm login pod

Identify the drain reason

Aliases for node monitoring

Manually drain a Slurm node

Manually undrain a Slurm node

Undrain a single node

Undrain all drained nodes

Undrain nodes by reason

Common drain reasons

Drain reasons related to Kubernetes

Drain reasons related to Slurm