Skip to main content

Manage Node Pools

Delete, modify, and reboot Node Pools using the API or CKS Cloud Console

Info

Nodes are currently added to clusters manually by CoreWeave. Please contact support for assistance.

Reboot Nodes using Conditioner

Nodes are managed as part of CoreWeave's Node Life Cycle. Required reboots are handled by the Node Life Cycle controller in order to ensure that customer workloads are not interrupted. If there is a reason to reboot a Node manually, a Node condition may be manually set using the third-party Kubectl plugin Conditioner.

Info

For additional assistance using Conditioner, please contact support.

Prerequisites

Prior to using Conditioner, ensure the following is complete.

Important

When rebooting Nodes:

  • You must provide CoreWeave advance notice before initiating a reboot.
  • Limit the number of Nodes that are rebooted at one time to avoid service interruptions.
  • Only reboot Nodes featuring the label node.coreweave.cloud/reserved=<YOUR_ORG_ID> so control plane Nodes will continue to function without interruptions.
  • If you need to reboot control plane Nodes, please contact support for assistance.

Reboot methods

You can request a Node reboot by either setting a condition on the Node or by using Slurm's scontrol command.

Set a condition on the Node

CoreWeave uses two Node conditions to manage reboots, depending on the desired behavior:

AdminSafeReboot

Marks the Node to reboot when it is idle, only after all running jobs are complete.

When all Pods have exited, the Node reboots cleanly. Pods may still be scheduled to a Node marked with this condition.

AdminImmediateReboot

Reboots a Node without ensuring all jobs are complete.

When this condition is set, a termination signal is sent to all Pods. The Node reboots as soon as all Pods have terminated or after ten minutes - whichever happens first.

To set a condition on a Node, use the Conditioner plugin to set the condition on the Node. For example, to set the AdminSafeReboot condition on a Node named node-1, run:

Example
$
kubectl conditioner node-1 --type AdminSafeReboot --status true --reason "Reason for this reboot"

The --type and --status flags are required. The --reason flag is optional, but may be used to provide a reason, which is a best practice.

For more information, see the official Conditioner documentation.

Use scontrol (Slurm)

To safely reboot Nodes in a cluster when they become idle, use scontrol reboot. Slurm ensures the Node is idle, then reboots it by setting the AdminImmediateReboot condition.

To reboot example-node with scontrol, run:

Example
$
scontrol reboot example-node

For more information, see the Slurm documentation.