Draining a Slurm node prevents it from accepting new jobs while allowing any currently running jobs to complete. Once a node is drained, it enters aDocumentation Index
Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
Use this file to discover all available pages before exploring further.
drained state and remains unavailable for scheduling until it is undrained (resumed). Unlike Kubernetes node draining, which evicts running Pods, Slurm draining does not terminate running workloads, it only stops new jobs from being assigned to the node.
SUNK login nodes include built-in aliases for draining, undraining, and monitoring node states. These are available in every SUNK login node shell out of the box:
| Alias | Description |
|---|---|
drain <node> [REASON] | Drain a node with a reason, automatically prefixed with your username |
undrain <node> | Undrain (resume) a single node |
dl | List all nodes in drain and idle state with reasons |
dld | List all nodes in drain state with reasons |
sn | Show detailed node information (scontrol show node) |
When nodes are drained automatically
CoreWeave’s HPC Verification framework periodically checks the health of all compute nodes in a cluster. If an issue is detected during one of these health checks, the affected nodes are automatically drained. In some cases, CoreWeave’s services automatically drain, restart, recheck, and undrain the node with no action required from the user. Nodes may also be drained in response to an underlying error or event, such as a failing prolog or epilog script. These may require troubleshooting to determine the underlying cause.Automatic undraining with health checks
When a drain reason includessunk:verify-undrain, the node will be automatically undrained after it passes the next hourly HPC Verification health check. No user action is required. If the node fails the health check, it remains drained and the underlying issue must be investigated.
This mechanism also applies when you manually drain a node. If you drain a node and include sunk:verify-undrain in the reason, the node will be automatically returned to service once it passes the next health check. This is useful when you want to temporarily remove a node from service and have it automatically return after CoreWeave’s health checks confirm the node is healthy.
For example, to drain a node and have it automatically undrain after a passing health check:
drained state, you can identify the drain reason to determine whether action needs to be taken. Outside of the automated HPC Verification cycle, you can manually drain a node if you need to temporarily prevent it from accepting new jobs, such as when performing maintenance, and manually undrain the node to return it to service.
CoreWeave’s documentation also includes an overview of Slurm node states, and further information about nodes in the INVAL state.
Connect to the Slurm login pod
To drain and undrain nodes, first connect to the Slurm login pod. After connecting to the Slurm login node, you can use the built-in aliases andscontrol commands detailed below to examine and manage drained nodes.
Identify the drain reason
To find out why a node is in a drain state, use thescontrol show node command:
[NODE-NAME] with the actual name of the node you want to check, or remove the [NODE-NAME] entirely to list all nodes.
The output of these commands shows a reason for the node’s drain state. Based on the reason listed, you can determine if the drained state is due to a Kubernetes event or a Slurm issue and proceed accordingly.
An asterisk
* alongside a node state indicates that the node is not responding. Nodes in a drain* or down* state have been removed from the cluster and can be ignored. You may see this suffix if you checked the state while the Pod was not yet fully connected.Aliases for node monitoring
CoreWeave provides several built-in aliases as part of the SUNK login node image for monitoring node states. These are available in every login node shell session. Thesn alias runs scontrol show node:
dl alias lists all nodes in drain and idle state, along with the reason:
dl alias is equivalent to:
dld alias lists all nodes in drain state (including those actively draining with running jobs):
Manually drain a Slurm node
Generally, Slurm nodes will be automatically drained if an issue is found. If you plan to perform maintenance on a node or want to temporarily remove it from service, you may wish to drain it manually. The simplest way to drain a node is with the built-indrain alias:
drain alias automatically prefixes the reason with your username, resulting in a drain reason in the following format:
sunk:verify-undrain in the reason:
scontrol update directly:
[NODE-NAME] with the name of the node you’re draining.
Manually undrain a Slurm node
Drained nodes withsunk:verify-undrain listed in the drain reason will be automatically undrained after passing the next hourly HPC Verification health check, with no user action required. You can manually undrain these nodes if you do not want to wait for the automation to undrain it.
If you manually drained the node and have already corrected any underlying issues, you can proceed to manually undrain the node at your discretion.
Undrain a single node
The simplest way to undrain a node is with the built-inundrain alias:
scontrol update directly to change the node state to resume:
[NODE-NAME] with the name of the node you’re undraining.
Monitor the node after undraining it. If the issue that caused the automatic drain persists, the node may be automatically drained again the next time it tries to run a job.
Undrain all drained nodes
To undrain all nodes currently in adrain or idle state, use the following command:
dl alias described above.
Undrain nodes by reason
Usegrep to undrain nodes that have been drained for a specific reason. Before undraining nodes in this manner, run the following command to confirm which nodes will be processed:
scheduler: k8s pod deletion timeout for job listed in the drain reason:
dl alias described above.
Common drain reasons
Drain reasons related to Kubernetes
If the drain reason starts withk8s:, it means SUNK has drained the node due to a Kubernetes-related event, such as the node being cordoned. Draining for these reasons is often temporary and resolves automatically.
A drain reason of
k8s: pod scheduled for deletion indicates that the node is waiting to update and will not accept new jobs to avoid disrupting active work. This often appears on Slurm nodes when the Kubernetes NodeSet is updated. Once the Compute pod restarts, Slurm should undrain the node without any action needed from you.Drain reasons related to Slurm
If the reason does not includek8s, this likely indicates a Slurm failure, and you may need to manually undrain the node after fixing any underlying problems.
If the drain reason contains sunk:verify-undrain, the node will automatically be returned to service after passing the next hourly HPC Verification health check. If the node fails this health check, the underlying issue must be investigated. You can view the output of these health checks in the Node Details Grafana dashboard, or with the sn command.
The following table lists common Slurm-related drain reasons:
| Drain reason | Meaning | Safe to undrain? |
|---|---|---|
NHC: [REASON] | There was a Node Health Check failure during the epilog task. | Yes. |
sunk:verify-undrain | SUNK will automatically return the node to service after the next passing HPC Verification check. | Yes. |
prolog pre-hook failed | A pre-hook task failed before starting the Slurm job. This is typically a temporary issue that will automatically resolve. | Yes. |
scheduler: k8s pod deletion timeout | Slurm detected a failure during the scheduler-epilog.sh task. | Yes. |
batch job complete failure | There have been intermittent errors with Slurm, or there has been a node failure. | Yes, if you have verified the node is healthy. |