Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt

Use this file to discover all available pages before exploring further.

Draining a Slurm node prevents it from accepting new jobs while allowing any currently running jobs to complete. Once a node is drained, it enters a drained state and remains unavailable for scheduling until it is undrained (resumed). Unlike Kubernetes node draining, which evicts running Pods, Slurm draining does not terminate running workloads, it only stops new jobs from being assigned to the node. SUNK login nodes include built-in aliases for draining, undraining, and monitoring node states. These are available in every SUNK login node shell out of the box:
AliasDescription
drain <node> [REASON]Drain a node with a reason, automatically prefixed with your username
undrain <node>Undrain (resume) a single node
dlList all nodes in drain and idle state with reasons
dldList all nodes in drain state with reasons
snShow detailed node information (scontrol show node)

When nodes are drained automatically

CoreWeave’s HPC Verification framework periodically checks the health of all compute nodes in a cluster. If an issue is detected during one of these health checks, the affected nodes are automatically drained. In some cases, CoreWeave’s services automatically drain, restart, recheck, and undrain the node with no action required from the user. Nodes may also be drained in response to an underlying error or event, such as a failing prolog or epilog script. These may require troubleshooting to determine the underlying cause.

Automatic undraining with health checks

When a drain reason includes sunk:verify-undrain, the node will be automatically undrained after it passes the next hourly HPC Verification health check. No user action is required. If the node fails the health check, it remains drained and the underlying issue must be investigated. This mechanism also applies when you manually drain a node. If you drain a node and include sunk:verify-undrain in the reason, the node will be automatically returned to service once it passes the next health check. This is useful when you want to temporarily remove a node from service and have it automatically return after CoreWeave’s health checks confirm the node is healthy. For example, to drain a node and have it automatically undrain after a passing health check:
drain [NODE-NAME] "investigating issue (sunk:verify-undrain)"
If you have a compute node in a drained state, you can identify the drain reason to determine whether action needs to be taken. Outside of the automated HPC Verification cycle, you can manually drain a node if you need to temporarily prevent it from accepting new jobs, such as when performing maintenance, and manually undrain the node to return it to service. CoreWeave’s documentation also includes an overview of Slurm node states, and further information about nodes in the INVAL state.

Connect to the Slurm login pod

To drain and undrain nodes, first connect to the Slurm login pod. After connecting to the Slurm login node, you can use the built-in aliases and scontrol commands detailed below to examine and manage drained nodes.
Run all Slurm commands, including scontrol, from within the Slurm login pod shell.

Identify the drain reason

To find out why a node is in a drain state, use the scontrol show node command:
scontrol show node [NODE-NAME]
Replace [NODE-NAME] with the actual name of the node you want to check, or remove the [NODE-NAME] entirely to list all nodes. The output of these commands shows a reason for the node’s drain state. Based on the reason listed, you can determine if the drained state is due to a Kubernetes event or a Slurm issue and proceed accordingly.
An asterisk * alongside a node state indicates that the node is not responding. Nodes in a drain* or down* state have been removed from the cluster and can be ignored. You may see this suffix if you checked the state while the Pod was not yet fully connected.

Aliases for node monitoring

CoreWeave provides several built-in aliases as part of the SUNK login node image for monitoring node states. These are available in every login node shell session. The sn alias runs scontrol show node:
sn
The dl alias lists all nodes in drain and idle state, along with the reason:
dl
The dl alias is equivalent to:
sinfo -t "drain&idle" -NO "NodeList:45,Comment:10,Timestamp:25,Reason:130" | uniq
The dld alias lists all nodes in drain state (including those actively draining with running jobs):
dld

Manually drain a Slurm node

Generally, Slurm nodes will be automatically drained if an issue is found. If you plan to perform maintenance on a node or want to temporarily remove it from service, you may wish to drain it manually. The simplest way to drain a node is with the built-in drain alias:
drain [NODE-NAME] "the reason for draining"
The drain alias automatically prefixes the reason with your username, resulting in a drain reason in the following format:
<username>: <reason given by user>
To drain a node and have it automatically undrained after it passes the next HPC Verification health check, include sunk:verify-undrain in the reason:
drain [NODE-NAME] "investigating issue (sunk:verify-undrain)"
Alternatively, you can use scontrol update directly:
scontrol update nodename=[NODE-NAME] state=drain reason="the reason for draining"
Replace [NODE-NAME] with the name of the node you’re draining.

Manually undrain a Slurm node

Drained nodes with sunk:verify-undrain listed in the drain reason will be automatically undrained after passing the next hourly HPC Verification health check, with no user action required. You can manually undrain these nodes if you do not want to wait for the automation to undrain it. If you manually drained the node and have already corrected any underlying issues, you can proceed to manually undrain the node at your discretion.

Undrain a single node

The simplest way to undrain a node is with the built-in undrain alias:
undrain [NODE-NAME]
Alternatively, use scontrol update directly to change the node state to resume:
scontrol update nodename=[NODE-NAME] state=resume
Replace [NODE-NAME] with the name of the node you’re undraining. Monitor the node after undraining it. If the issue that caused the automatic drain persists, the node may be automatically drained again the next time it tries to run a job.

Undrain all drained nodes

To undrain all nodes currently in a drain or idle state, use the following command:
for node in $(dl | cut -d " " -f 1); do
    echo "undraining $node";
    scontrol update nodename=$node state=resume;
done
Note that this command makes use of the dl alias described above.

Undrain nodes by reason

Use grep to undrain nodes that have been drained for a specific reason. Before undraining nodes in this manner, run the following command to confirm which nodes will be processed:
dl | grep [REASON]
After confirming that the listed nodes match those intended to be undrained, proceed to undrain them as detailed below. In this example, we undrain only nodes with scheduler: k8s pod deletion timeout for job listed in the drain reason:
for node in $(dl | grep "scheduler: k8s pod deletion timeout for job" | cut -d " " -f 1); do
    echo "undraining $node";
    scontrol update nodename=$node state=resume;
done
Note that this command makes use of the dl alias described above.

Common drain reasons

If the drain reason starts with k8s:, it means SUNK has drained the node due to a Kubernetes-related event, such as the node being cordoned. Draining for these reasons is often temporary and resolves automatically.
A drain reason of k8s: pod scheduled for deletion indicates that the node is waiting to update and will not accept new jobs to avoid disrupting active work. This often appears on Slurm nodes when the Kubernetes NodeSet is updated. Once the Compute pod restarts, Slurm should undrain the node without any action needed from you.
If the reason does not include k8s, this likely indicates a Slurm failure, and you may need to manually undrain the node after fixing any underlying problems. If the drain reason contains sunk:verify-undrain, the node will automatically be returned to service after passing the next hourly HPC Verification health check. If the node fails this health check, the underlying issue must be investigated. You can view the output of these health checks in the Node Details Grafana dashboard, or with the sn command. The following table lists common Slurm-related drain reasons:
Drain reasonMeaningSafe to undrain?
NHC: [REASON]There was a Node Health Check failure during the epilog task.Yes.
sunk:verify-undrainSUNK will automatically return the node to service after the next passing HPC Verification check.Yes.
prolog pre-hook failedA pre-hook task failed before starting the Slurm job. This is typically a temporary issue that will automatically resolve.Yes.
scheduler: k8s pod deletion timeoutSlurm detected a failure during the scheduler-epilog.sh task.Yes.
batch job complete failureThere have been intermittent errors with Slurm, or there has been a node failure.Yes, if you have verified the node is healthy.
Last modified on April 20, 2026