Manage Node Pools

Delete, modify, and reboot Node Pools using the API or CKS Cloud Console

Node Pools can be modified after they have been created, either by editing them in the Cloud Console, or using Kubernetes by deploying an adjusted manifest.

Modify a Node Pool using the Cloud Console

To modify a Node Pool using the Cloud Console, navigate to the Node Pool page from the left-hand menu.

All existing Node Pools are listed on the Node Pool dashboard, including the Node Pool containing Control Plane CPU Nodes. To modify a deployed Node Pool, click the vertical dot menu to the right of the Node Pool, then click Edit to open the manifest editor.

Once the desired changes are made, click the Submit button.

Note

Changes may take a moment to display on the Cloud Console. To learn more about the current status of the Node Pool, hover over the status in the dashboard.

Delete a Node Pool using the Cloud Console

To delete a Node Pool using the Cloud Console, navigate to the Node Pool dashboard from the left-hand menu. Click the vertical dot menu beside the Node Pool to delete, then click Delete to open the confirmation modal.

Enter the name of the Node Pool to confirm deletion. The dashboard will update immediately, removing the deleted Node Pool from the list.

Modify a Node Pool using Kubernetes

To modify a Node Pool, first edit the Node Pool manifest, then apply the changed manifest using Kubectl.

For example, take a Node Pool deployed with this manifest:

example-nodepool.yaml

apiVersion: compute.coreweave.com/v1alpha1
kind: NodePool
metadata:
  name: example-nodepool
spec:
  computeClass: default
  instanceType: gd-8xh100ib-i128
  autoscaling: false
  targetNodes: 10
  minNodes: 0
  maxNodes: 0

This manifest deploys a Node Pool with 10 Nodes. To change it to have only 5 Nodes, first adjust the manifest to change the targetNodes value to 5 as highlighted here:

example-nodepool.yaml

apiVersion: compute.coreweave.com/v1alpha1
kind: NodePool
metadata:
  name: example-nodepool
spec:
  computeClass: default
  instanceType: gd-8xh100ib-i128
  autoscaling: false
  targetNodes: 5
  minNodes: 0
  maxNodes: 0

To apply the changes, apply the updated manifest using Kubectl:

Example command

$
kubectl apply -f example-nodepool.yaml

Once the manifest is applied, CKS will adjust the number of Nodes in the Node Pool accordingly.

Info

Removing Nodes from a Node Pool can take some time.
See Scaling strategies to learn how CKS responds when the value of targetNodes changes.

Using `kubectl scale`

Alternatively, as a shortcut, you can use kubectl to adjust targetNodes without editing the manifest directly.

For example, to set targetNodes for example-nodepool to 5 Nodes, run:

Example command

$
kubectl scale nodepool example-nodepool --replicas 5

Verify the updated Node Pool

To check the status of the modified Node Pool, target the Node Pool with kubectl get nodepool. For example:

Example command

$
kubectl get nodepool example-nodepool

This returns information about the current status of the targeted Node Pool, such as:

Example output

NAME               INSTANCE TYPE   TARGET   INPROGRESS   CURRENT   VALIDATED   CAPACITY     QUOTA   AGE
example-nodepool   gd-1xgh200      1                     1         Valid       Sufficient   Under   42h

Once the adjustment is complete, the value of CURRENT (the number of Nodes currently in the Node Pool) should match the value of TARGET (the number of Nodes desired in the Node Pool).

To see further details about the Node Pool in YAML format, target the Node Pool with kubectl get nodepool -o yaml.

Example command

$
kubectl get nodepool example-nodepool -o yaml

Reboot Nodes on CKS

If a Node must be rebooted manually, this can be done by setting a Node condition using the third-party Kubectl plugin, Conditioner.

Before you start

Please note that Nodes are managed as part of CoreWeave's Node lifecycle. Required reboots are handled by the Node Life Cycle controller, in order to ensure that customer workloads are not interrupted.

This guide covers outlying situations in which Nodes may need to be rebooted manually.

Prerequisites

Prior to using Conditioner, ensure the following is complete.

Kubectl installed locally
An API Access Token
Conditioner installed locally

Important

When rebooting Nodes, limit the number of Nodes that are rebooted at one time to avoid service interruptions. If you need to reboot 50 or more Nodes at a time, please contact support for assistance.

Note

You may add your own Node conditions to the Nodes, but do not leverage CoreWeave Node conditions for automation. CoreWeave Node conditions are intended for internal use only and are not intended for clients to use for their own custom management automation. Nodes may be cordoned by CoreWeave for maintenance purposes, or to resolve temporary issues.

Reboot methods

You can request a Node reboot by either setting a condition on the Node or by using Slurm's scontrol command.

Set a condition on the Node

CoreWeave uses two Node conditions to manage reboots, depending on the desired behavior:

`AdminSafeReboot`

Marks the Node to reboot when it is idle, only after all running jobs are complete.

When all Pods have exited, the Node reboots cleanly. Pods may still be scheduled to a Node marked with this condition.

`AdminImmediateReboot`

Reboots a Node without ensuring all jobs are complete.

When this condition is set, a termination signal is sent to all Pods. The Node reboots as soon as all Pods have terminated or after ten minutes - whichever happens first.

To set a condition on a Node, use the Conditioner plugin to set the condition on the Node. For example, to set the AdminSafeReboot condition on a Node named node-1, run:

Example

$
kubectl conditioner node-1 --type AdminSafeReboot --status true --reason "Reason for this reboot"

The --type and --status flags are required. The --reason flag is optional, but may be used to provide a reason, which is a best practice.

For more information, see the official Conditioner documentation.

Use `scontrol` (Slurm)

To safely reboot Nodes in a cluster when they become idle, use scontrol reboot. Slurm ensures the Node is idle, then reboots it by setting the AdminImmediateReboot condition.

To reboot example-node with scontrol, run:

Example

$
scontrol reboot example-node

For more information, see the Slurm documentation.

Info

For additional assistance using Conditioner, please contact support.

Delete a Node Pool using Kubernetes

To delete a Node Pool using Kubernetes, delete the Node Pool resource directly using Kubectl:

Example

$
kubectl delete nodepool example-nodepool

Deleting the nodepool resource first removes all Nodes associated with the Node Pool from the cluster, then deletes the Node Pool resource itself.

Warning

To avoid data loss when removing Node Pools, first ensure nothing is running in the Node Pool before deleting it. CKS does not wait for workloads to complete before removing Nodes.

Modify a Node Pool using the Cloud Console​

Delete a Node Pool using the Cloud Console​

Modify a Node Pool using Kubernetes​

Using kubectl scale​

Verify the updated Node Pool​

Reboot Nodes on CKS​

Before you start​

Prerequisites​

Reboot methods​

Set a condition on the Node​

AdminSafeReboot​

AdminImmediateReboot​

Use scontrol (Slurm)​

Delete a Node Pool using Kubernetes​

Modify a Node Pool using the Cloud Console

Delete a Node Pool using the Cloud Console

Modify a Node Pool using Kubernetes

Using `kubectl scale`

Verify the updated Node Pool

Reboot Nodes on CKS

Before you start

Prerequisites

Reboot methods

Set a condition on the Node

`AdminSafeReboot`

`AdminImmediateReboot`

Use `scontrol` (Slurm)

Delete a Node Pool using Kubernetes