Reboot Nodes
Reboot nodes manually to apply Node Pool updates
Nodes are managed as part of CoreWeave's Node lifecycle. Required reboots are handled by the Node Life Cycle controller in order to ensure that customer workloads are not interrupted.
This guide explains how to reboot Nodes manually. For example, you may need to reboot Nodes manually to apply a system update or other change.
If you need to reboot Nodes to apply Node Pool updates, see Apply Node Pool updates.
Types of Node reboots
You can reboot Nodes whether or not they still have active workloads. Active Nodes are those with the CWActive condition set to true, which indicates they have active workloads such as non-DaemonSet Pods or Slurm jobs.
- A safe reboot marks the Node to reboot once it becomes idle. The reboot occurs only after all active workloads have stopped. New Pods may still be scheduled while waiting.
- A force reboot reboots the Node immediately without waiting for active workloads to stop.
You can check a Node's active status by using the cwic node get or cwic node describe commands.
After the Node has rebooted, it remains in a reboot state while a short self-test passes. Once the test passes, it returns to a production state and accepts customer workloads.
Prerequisites
Before you begin, ensure you have:
- An active CoreWeave account
- An API Access Token
- Kubectl and the latest version of the CoreWeave Intelligent CLI installed locally
The following tools are optional, but can be helpful:
scontrolinstalled locally (for Slurm users)
Requesting a Node reboot
You can request a Node reboot in the following ways:
- Using the CoreWeave Intelligent CLI to set up a reboot (recommended)
- Using Slurm's
scontrolcommand (for Slurm users)
Using the CoreWeave Intelligent CLI
To use the CoreWeave Intelligent CLI to reboot a Node, run the following command, choosing a flag from the list below and replacing [list-of-space-separated-nodes] with the list of Nodes you want to reboot:
$cwic node reboot [--force|--safe|--unset] [list-of-space-separated-nodes]
The --force flag queues a force reboot, the --safe flag queues a safe reboot, and the --unset flag clears the reboot condition.
For example, to queue a force reboot for the Nodes node-1 and node-2, run:
$cwic node reboot --force node-1 node-2
Adding a reason for the reboot is optional, but recommended. To set a message for the reboot, use the --message flag. For example, run:
$cwic node reboot --message "Reason for this reboot" node-1 node-2
Run cwic node reboot --help for more information.
Use scontrol (Slurm)
To safely reboot Nodes in a cluster when they become idle, use scontrol reboot. Slurm ensures the Node is idle, then reboots it by setting the appropriate reboot condition.
To reboot example-node with scontrol, run:
$scontrol reboot example-node
For more information, see the Slurm documentation.