Manage Node Pools
Delete, modify, and reboot Node Pools using the API or CKS Cloud Console
Nodes are currently added to clusters manually by CoreWeave. Please contact support for assistance.
Reboot Nodes using Conditioner
Nodes are managed as part of CoreWeave's Node Life Cycle. Required reboots are handled by the Node Life Cycle controller in order to ensure that customer workloads are not interrupted. If there is a reason to reboot a Node manually, a Node condition may be manually set using the third-party Kubectl plugin Conditioner.
For additional assistance using Conditioner, please contact support.
Prerequisites
Prior to using Conditioner, ensure the following is complete.
- Kubectl installed locally
- An API Access Token
- Conditioner installed locally
When rebooting Nodes:
- You must provide CoreWeave advance notice before initiating a reboot.
- Limit the number of Nodes that are rebooted at one time to avoid service interruptions.
- Only reboot Nodes featuring the label
node.coreweave.cloud/reserved=<YOUR_ORG_ID>
so control plane Nodes will continue to function without interruptions. - If you need to reboot control plane Nodes, please contact support for assistance.
Reboot methods
You can request a Node reboot by either setting a condition on the Node or by using Slurm's scontrol
command.
Set a condition on the Node
CoreWeave uses two Node conditions to manage reboots, depending on the desired behavior:
AdminSafeReboot
Marks the Node to reboot when it is idle, only after all running jobs are complete.
When all Pods have exited, the Node reboots cleanly. Pods may still be scheduled to a Node marked with this condition.
AdminImmediateReboot
Reboots a Node without ensuring all jobs are complete.
When this condition is set, a termination signal is sent to all Pods. The Node reboots as soon as all Pods have terminated or after ten minutes - whichever happens first.
To set a condition on a Node, use the Conditioner plugin to set the condition on the Node. For example, to set the AdminSafeReboot
condition on a Node named node-1
, run:
$kubectl conditioner node-1 --type AdminSafeReboot --status true --reason "Reason for this reboot"
The --type
and --status
flags are required. The --reason
flag is optional, but may be used to provide a reason, which is a best practice.
For more information, see the official Conditioner documentation.
Use scontrol
(Slurm)
To safely reboot Nodes in a cluster when they become idle, use scontrol reboot
. Slurm ensures the Node is idle, then reboots it by setting the AdminImmediateReboot
condition.
To reboot example-node
with scontrol
, run:
$scontrol reboot example-node
For more information, see the Slurm documentation.