Skip to main content

Manage Node Pools

Delete, modify, and reboot Node Pools using the API or CKS Cloud Console

Node Pools can be modified after they have been created, either by editing them in the Cloud Console, or using Kubernetes by deploying an adjusted manifest.

Modify a Node Pool using the Cloud Console

To modify a Node Pool using the Cloud Console, navigate to the Node Pool page from the left-hand menu.

All existing Node Pools are listed on the Node Pool dashboard, including the Node Pool containing control plane CPU Nodes. To modify a deployed Node Pool, click the vertical dot menu to the right of the Node Pool, then click Edit to open the manifest editor.

Once the desired changes are made, click the Submit button.

Note

Changes may take a moment to display on the Cloud Console. To learn more about the current status of the Node Pool, hover over the status in the dashboard.

Delete a Node Pool using the Cloud Console

To delete a Node Pool using the Cloud Console, navigate to the Node Pool dashboard from the left-hand menu. Click the vertical dot menu beside the Node Pool to delete, then click Delete to open the confirmation modal.

Enter the name of the Node Pool to confirm deletion. The dashboard will update immediately, removing the deleted Node Pool from the list.

Modify a Node Pool using Kubernetes

To modify a Node Pool, first edit the Node Pool manifest, then apply the changed manifest using Kubectl. For example, take a Node Pool deployed with this manifest:

example-nodepool.yaml
apiVersion: compute.coreweave.com/v1alpha1
kind:
metadata:
name: example-nodepool
spec:
instanceType: gd-8xh100-i128
autoscaling: false
targetNodes: 10
minNodes: 0
maxNodes: 0

This manifest deploys a Node Pool with 10 Nodes. To change it to have only 5 Nodes, first adjust the manifest to change the targetNodes value to 5 as highlighted here:

example-nodepool.yaml
apiVersion: compute.coreweave.com/v1alpha1
kind:
metadata:
name: example-nodepool
spec:
instanceType: gd-8xh100-i128
autoscaling: false
targetNodes: 5
minNodes: 0
maxNodes: 0
Note

If you are changing the value of targetNodes, first ensure nothing is running in the Node Pool. CKS does not wait for workloads to complete before removing Nodes.

To apply the changes, apply the updated manifest using Kubectl:

Example
$
kubectl apply -f example-nodepool.yaml

Once the manifest is applied, CKS will adjust the number of Nodes in the Node Pool accordingly.

Note

Removing Nodes from a Node Pool may take a moment.

Verify the updated Node Pool

To check the status of the modified Node Pool, target the Node Pool with kubectl get nodepool. For example:

Example command
$
kubectl get nodepool example-nodepool

This returns information about the current status of the targeted Node Pool, such as:

Example output
| NAME | INSTANCE TYPE | TARGET | CURRENT | READY | REASON | AGE |
|------------------|----------------|--------|---------|-------|------------------------|-----|
| example-nodepool | gd-8xh100-i128 | 3 | 0 | FALSE | NodeAllocationComplete | 1h |

Once the adjustment is complete, the value of CURRENTNODES should match the value of TARGETNODES.

Reboot Nodes on CKS

If a Node must be rebooted manually, this can be done by setting a Node condition using the third-party Kubectl plugin, Conditioner.

Before you start

Please note that Nodes are managed as part of CoreWeave's Node Life Cycle. Required reboots are handled by the Node Life Cycle controller, in order to ensure that customer workloads are not interrupted.

This guide covers outlying situations in which Nodes may need to be rebooted manually.

Prerequisites

Prior to using Conditioner, ensure the following is complete.

Important

When rebooting Nodes:

  • You must provide CoreWeave advance notice before initiating a reboot.
  • Limit the number of Nodes that are rebooted at one time to avoid service interruptions.
  • Only reboot Nodes featuring the label node.coreweave.cloud/reserved=<YOUR_ORG_ID> so control plane Nodes will continue to function without interruptions.
  • If you need to reboot control plane Nodes, please contact support for assistance.
Note

Users should not leverage Node conditions for automation, as the Node conditions are intended for internal use only. CoreWeave may cordon Nodes for maintenance purposes, or, to resolve temporary issues. Conditions are not intended for clients to use for designing their own custom management automation.

Reboot methods

You can request a Node reboot by either setting a condition on the Node or by using Slurm's scontrol command.

Set a condition on the Node

CoreWeave uses two Node conditions to manage reboots, depending on the desired behavior:

AdminSafeReboot

Marks the Node to reboot when it is idle, only after all running jobs are complete.

When all Pods have exited, the Node reboots cleanly. Pods may still be scheduled to a Node marked with this condition.

AdminImmediateReboot

Reboots a Node without ensuring all jobs are complete.

When this condition is set, a termination signal is sent to all Pods. The Node reboots as soon as all Pods have terminated or after ten minutes - whichever happens first.

To set a condition on a Node, use the Conditioner plugin to set the condition on the Node. For example, to set the AdminSafeReboot condition on a Node named node-1, run:

Example
$
kubectl conditioner node-1 --type AdminSafeReboot --status true --reason "Reason for this reboot"

The --type and --status flags are required. The --reason flag is optional, but may be used to provide a reason, which is a best practice.

For more information, see the official Conditioner documentation.

Use scontrol (Slurm)

To safely reboot Nodes in a cluster when they become idle, use scontrol reboot. Slurm ensures the Node is idle, then reboots it by setting the AdminImmediateReboot condition.

To reboot example-node with scontrol, run:

Example
$
scontrol reboot example-node

For more information, see the Slurm documentation.

Info

For additional assistance using Conditioner, please contact support.

Delete a Node Pool using Kubernetes

To delete a Node Pool using Kubernetes, delete the Node Pool resource directly using Kubectl:

Example
$
kubectl delete nodepool example-nodepool

Deleting the nodepool resource first removes all Nodes associated with the Node Pool from the cluster, then deletes the Node Pool resource itself.

Warning

To avoid data loss when removing Node Pools, first ensure nothing is running in the Node Pool before deleting it. CKS does not wait for workloads to complete before removing Nodes.