Manage Node Pools
Delete, modify, and reboot Node Pools using the API or CKS Cloud Console
Node Pools can be modified after they have been created, either by editing them in the Cloud Console, or using Kubernetes by deploying an adjusted manifest.
Modify a Node Pool using the Cloud Console
To modify a Node Pool using the Cloud Console, navigate to the Node Pool page from the left-hand menu.
All existing Node Pools are listed on the Node Pool dashboard, including the Node Pool containing control plane CPU Nodes. To modify a deployed Node Pool, click the vertical dot menu to the right of the Node Pool, then click Edit to open the manifest editor.
Once the desired changes are made, click the Submit button.
Changes may take a moment to display on the Cloud Console. To learn more about the current status of the Node Pool, hover over the status in the dashboard.
Delete a Node Pool using the Cloud Console
To delete a Node Pool using the Cloud Console, navigate to the Node Pool dashboard from the left-hand menu. Click the vertical dot menu beside the Node Pool to delete, then click Delete to open the confirmation modal.
Enter the name of the Node Pool to confirm deletion. The dashboard will update immediately, removing the deleted Node Pool from the list.
Modify a Node Pool using Kubernetes
To modify a Node Pool, first edit the Node Pool manifest, then apply
the changed manifest using Kubectl. For example, take a Node Pool deployed with this manifest:
apiVersion: compute.coreweave.com/v1alpha1kind:metadata:name: example-nodepoolspec:instanceType: gd-8xh100-i128autoscaling: falsetargetNodes: 10minNodes: 0maxNodes: 0
This manifest deploys a Node Pool with 10
Nodes. To change it to have only 5
Nodes, first adjust the manifest to change the targetNodes
value to 5
as highlighted here:
apiVersion: compute.coreweave.com/v1alpha1kind:metadata:name: example-nodepoolspec:instanceType: gd-8xh100-i128autoscaling: falsetargetNodes: 5minNodes: 0maxNodes: 0
If you are changing the value of targetNodes
, first ensure nothing is running in the Node Pool. CKS does not wait for workloads to complete before removing Nodes.
To apply the changes, apply the updated manifest using Kubectl:
$kubectl apply -f example-nodepool.yaml
Once the manifest is applied, CKS will adjust the number of Nodes in the Node Pool accordingly.
Removing Nodes from a Node Pool may take a moment.
Verify the updated Node Pool
To check the status of the modified Node Pool, target the Node Pool with kubectl get nodepool
. For example:
$kubectl get nodepool example-nodepool
This returns information about the current status of the targeted Node Pool, such as:
| NAME | INSTANCE TYPE | TARGET | CURRENT | READY | REASON | AGE ||------------------|----------------|--------|---------|-------|------------------------|-----|| example-nodepool | gd-8xh100-i128 | 3 | 0 | FALSE | NodeAllocationComplete | 1h |
Once the adjustment is complete, the value of CURRENTNODES
should match the value of TARGETNODES
.
Reboot Nodes on CKS
If a Node must be rebooted manually, this can be done by setting a Node condition using the third-party Kubectl plugin, Conditioner.
Before you start
Please note that Nodes are managed as part of CoreWeave's Node Life Cycle. Required reboots are handled by the Node Life Cycle controller, in order to ensure that customer workloads are not interrupted.
This guide covers outlying situations in which Nodes may need to be rebooted manually.
Prerequisites
Prior to using Conditioner, ensure the following is complete.
- Kubectl installed locally
- An API Access Token
- Conditioner installed locally
When rebooting Nodes:
- You must provide CoreWeave advance notice before initiating a reboot.
- Limit the number of Nodes that are rebooted at one time to avoid service interruptions.
- Only reboot Nodes featuring the label
node.coreweave.cloud/reserved=<YOUR_ORG_ID>
so control plane Nodes will continue to function without interruptions. - If you need to reboot control plane Nodes, please contact support for assistance.
Users should not leverage Node conditions
for automation, as the Node conditions are intended for internal use only. CoreWeave may cordon Nodes for maintenance purposes, or, to resolve temporary issues. Conditions are not intended for clients to use for designing their own custom management automation.
Reboot methods
You can request a Node reboot by either setting a condition on the Node or by using Slurm's scontrol
command.
Set a condition on the Node
CoreWeave uses two Node conditions to manage reboots, depending on the desired behavior:
AdminSafeReboot
Marks the Node to reboot when it is idle, only after all running jobs are complete.
When all Pods have exited, the Node reboots cleanly. Pods may still be scheduled to a Node marked with this condition.
AdminImmediateReboot
Reboots a Node without ensuring all jobs are complete.
When this condition is set, a termination signal is sent to all Pods. The Node reboots as soon as all Pods have terminated or after ten minutes - whichever happens first.
To set a condition on a Node, use the Conditioner plugin to set the condition on the Node. For example, to set the AdminSafeReboot
condition on a Node named node-1
, run:
$kubectl conditioner node-1 --type AdminSafeReboot --status true --reason "Reason for this reboot"
The --type
and --status
flags are required. The --reason
flag is optional, but may be used to provide a reason, which is a best practice.
For more information, see the official Conditioner documentation.
Use scontrol
(Slurm)
To safely reboot Nodes in a cluster when they become idle, use scontrol reboot
. Slurm ensures the Node is idle, then reboots it by setting the AdminImmediateReboot
condition.
To reboot example-node
with scontrol
, run:
$scontrol reboot example-node
For more information, see the Slurm documentation.
For additional assistance using Conditioner, please contact support.
Delete a Node Pool using Kubernetes
To delete a Node Pool using Kubernetes, delete the Node Pool resource directly using Kubectl:
$kubectl delete nodepool example-nodepool
Deleting the nodepool
resource first removes all Nodes associated with the Node Pool from the cluster, then deletes the Node Pool resource itself.
To avoid data loss when removing Node Pools, first ensure nothing is running in the Node Pool before deleting it. CKS does not wait for workloads to complete before removing Nodes.