drained state and remains unavailable for scheduling until it’s undrained (resumed). Unlike Kubernetes node draining, which evicts running pods, Slurm draining doesn’t terminate running workloads. It only stops new jobs from being assigned to the node.
SUNK login nodes include built-in aliases for draining, undraining, and monitoring node states. These are available in every SUNK login node shell by default:
| Alias | Description |
|---|---|
drain [NODE-NAME] [REASON] | Drain a node with a reason, automatically prefixed with your username |
undrain [NODE-NAME] | Undrain (resume) a single node |
dl | List all nodes in drain and idle state with reasons |
dld | List all nodes in drain state with reasons |
sn | Show detailed node information (scontrol show node) |
When nodes are drained automatically
Before you manually drain or undrain a node, it helps to understand the cases where SUNK drains nodes on its own and resolves them without user intervention. CoreWeave’s HPC Verification framework periodically checks the health of all compute nodes in a cluster. If a health check detects an issue, SUNK automatically drains the affected nodes. In some cases, CoreWeave’s services automatically drain, restart, recheck, and undrain the node with no action required from you. Nodes can also be drained in response to an underlying error or event, such as a failing prolog or epilog script. These cases can require troubleshooting to determine the underlying cause.Automatic undrain with health checks
When a drain reason includessunk:verify-undrain, SUNK automatically undrains the node after it passes the next hourly HPC Verification health check. No user action is required. If the node fails the health check, it remains drained and you must investigate the underlying issue.
This mechanism also applies when you manually drain a node. If you drain a node and include sunk:verify-undrain in the reason, SUNK automatically returns the node to service once it passes the next health check. This is useful when you want to temporarily remove a node from service and have it automatically return after CoreWeave’s health checks confirm the node is healthy.
For example, to drain a node and have it automatically undrain after a passing health check:
drained state, you can identify the drain reason to determine whether you need to take action. Outside the automated HPC Verification cycle, you can manually drain a node if you need to temporarily prevent it from accepting new jobs, such as during maintenance, and manually undrain the node to return it to service.
CoreWeave’s documentation also includes an overview of Slurm node states, and further information about nodes in the INVAL state.
Connect to the Slurm login pod
To drain and undrain nodes, first connect to the Slurm login pod. After connecting to the Slurm login node, you can use the built-in aliases andscontrol commands in the following sections to examine and manage drained nodes.
Identify the drain reason
Before deciding whether to undrain a node, check the drain reason so you know whether the node is recovering automatically or requires manual intervention. To find out why a node is in a drain state, use thescontrol show node command:
[NODE-NAME] with the actual name of the node you want to check, or remove the [NODE-NAME] entirely to list all nodes.
The output of these commands shows a reason for the node’s drain state. Based on the reason listed, you can determine whether the drained state is due to a Kubernetes event or a Slurm issue and proceed accordingly.
An asterisk
* alongside a node state indicates that the node isn’t responding. Nodes in a drain* or down* state have been removed from the cluster and can be ignored. You can see this suffix if you check the state while the pod isn’t yet fully connected.Aliases for node monitoring
CoreWeave provides several built-in aliases as part of the SUNK login node image for monitoring node states. These are available in every login node shell session. Thesn alias runs scontrol show node:
dl alias lists all nodes in drain and idle state, along with the reason:
dl alias is equivalent to:
dld alias lists all nodes in drain state (including those actively draining with running jobs):
Manually drain a Slurm node
Generally, SUNK automatically drains Slurm nodes when it finds an issue. If you plan to perform maintenance on a node or want to temporarily remove it from service, you can drain it manually. The simplest way to drain a node is with the built-indrain alias:
drain alias automatically prefixes the reason with your username, resulting in a drain reason in the following format. Replace [USERNAME] with your username and [REASON] with the reason you provide.
sunk:verify-undrain in the reason:
scontrol update directly:
[NODE-NAME] with the name of the node you’re draining.
Manually undrain a Slurm node
SUNK automatically undrains drained nodes withsunk:verify-undrain listed in the drain reason after they pass the next hourly HPC Verification health check, with no user action required. You can manually undrain these nodes if you don’t want to wait for the automation to undrain them.
If you manually drained the node and have already corrected any underlying issues, you can manually undrain the node at your discretion.
Undrain a single node
The simplest way to undrain a node is with the built-inundrain alias:
scontrol update directly to change the node state to resume:
[NODE-NAME] with the name of the node you’re undraining.
Monitor the node after undraining it. If the issue that caused the automatic drain persists, SUNK can automatically drain the node again the next time it attempts to run a job.
Undrain all drained nodes
To undrain all nodes currently in adrain or idle state, use the following command:
dl alias.
Undrain nodes by reason
Usegrep to undrain nodes that have been drained for a specific reason. Before you undrain nodes in this manner, confirm which nodes the script processes:
scheduler: k8s pod deletion timeout for job listed in the drain reason:
dl alias.
Common drain reasons
The following sections describe common drain reasons grouped by source, and indicate whether the node typically recovers on its own or requires manual intervention.Drain reasons related to Kubernetes
If the drain reason starts withk8s:, SUNK has drained the node due to a Kubernetes-related event, such as a cordoned node. Draining for these reasons is often temporary and resolves automatically.
A drain reason of
k8s: pod scheduled for deletion indicates that the node is waiting to update and doesn’t accept new jobs to avoid disrupting active work. This often appears on Slurm nodes when the Kubernetes NodeSet is updated. After the Compute pod restarts, Slurm undrains the node without any action needed from you.Drain reasons related to Slurm
If the reason doesn’t includek8s, this likely indicates a Slurm failure, and you might need to manually undrain the node after fixing any underlying problems.
If the drain reason contains sunk:verify-undrain, SUNK automatically returns the node to service after it passes the next hourly HPC Verification health check. If the node fails this health check, you must investigate the underlying issue. You can view the output of these health checks in the Node Details Grafana dashboard, or with the sn command.
The following table lists common Slurm-related drain reasons:
| Drain reason | Meaning | Safe to undrain? |
|---|---|---|
NHC: [REASON] | There was a Node Health Check failure during the epilog task. | Yes. |
sunk:verify-undrain | SUNK automatically returns the node to service after the next passing HPC Verification check. | Yes. |
prolog pre-hook failed | A pre-hook task failed before starting the Slurm job. This is typically a temporary issue that resolves automatically. | Yes. |
scheduler: k8s pod deletion timeout | Slurm detected a failure during the scheduler-epilog.sh task. | Yes. |
batch job complete failure | There have been intermittent errors with Slurm, or there has been a node failure. | Yes, if you have verified the node is healthy. |