Autoscale Node Pools

Scale Node Pools based on workload resource utilization

Preview feature

Node Pool autoscaling is currently in preview and has the following limitations:

No SUNK integration: SUNK does not currently support autoscaling Node Pools.
Scale up time: Scaling up clusters takes between 20-30 minutes because the autoscaling process involves safely rebooting and re-adding bare metal Nodes. CoreWeave Nodes are bare metal to provide performance benefits and hardware-level access. We are working to optimize the reboot time.

CKS supports scaling Node Pools by using the Kubernetes Cluster Autoscaler, allowing you to scale CKS Node Pools in response to workload demands for GPU, CPU, or memory resources.

Cluster Autoscaler is enabled by default in all CKS clusters running Kubernetes 1.32 or later. To upgrade your clusters to the latest version, see Upgrade Kubernetes.

Configure autoscaling

The autoscaler will adjust the Node Pool's TargetNodes value within the min and max range that you define in the Node Pool manifest.

To enable autoscaling, set the following values:

autoscaling: Set autoscaling to true
maxNodes: Set the number of maximum Nodes you want to scale up to.
minNodes: Set the number of minimum Nodes you want to scale down to.

Example Node Pool manifest with the following values set:

autoscaling: Set autoscaling to true
maxNodes: Set to 4
minNodes: Set to 2

apiVersion: compute.coreweave.com/v1alpha1
kind: NodePool
metadata:
  name: example-nodepool
spec:
  computeClass: default
  autoscaling: true # Set autoscaling to true
  lifecycle:
    scaleDownStrategy: PreferIdle # Scale down a cluster as quickly as possible, see "Autoscaling behavior"
  instanceType: gd-8xh100ib-i128  # Select your desired instance type
  maxNodes: 4 # Set desired maximum nodes
  minNodes: 2 # Set desired minimum nodes
  targetNodes: 2
  nodeLabels:
    my-label/node: "true"
  nodeAnnotations:
    my-annotation/node: "true"
  nodeTaints:
    - key: node-taint
      value: "true"
      effect: NoSchedule

Autoscaling behavior

Autoscaling increases or decreases the number of Nodes in a Node Pool when the following occurs:

Scale up: When CKS cannot schedule Pods due to insufficient resources, like CPU or memory, CKS scales up the Node Pools. For more information, see the section How does up-scale work? in the Kubernetes docs.
Scale down: When CKS determines that Nodes are underutilized for a configured period, CKS scales down the Node Pools. For more information, see the section How does down-scale work? in the Kubernetes documentation.

The selected Node Pool scaling strategy affects how quickly the cluster autoscaler can scale down your Nodes. If you want to scale down a cluster as quickly as possible, or if binpacking is desired, use the aggressive PreferIdle strategy. If you have training jobs or other workloads that cannot be disrupted, use the cautious IdleOnly strategy. See Node Pool Scaling strategies for more information.

Node selectors and autoscaling

Cluster Autoscaler can scale the appropriate Node Pool when a Pod cannot be scheduled due to resource limits. CKS decides which Node Pool to scale based on the placement requirements defined in the Pod specification, for example, in the nodeSelector or affinity fields. These fields help the autoscaler choose a Node Pool that matches the Pod's requirements. If you don't specify Pod placement requirements, the autoscaler may scale any available Node Pool.

Autoscaling considerations

For autoscaling to work, the following criteria must be met:

Available quota: You must have the available quota amount that meets or exceeds the number specified in the maxNode field. For example, if you have maxNode set to 10, you must have that quota available in your organization. To check your org's quota, see the quota reference documentation.
Available capacity: The region where your cluster exists must have the capacity to provision the Nodes. For example, if the region your cluster is in doesn't have the capacity to provision Nodes, the CKS cannot scale your Node pools. To determine your org's capacity, see the capacity reference documentation

Scale-to-zero

The scale-to-zero feature is useful when you want to minimize resource costs.

For CKS to scale a Node Pool to zero, you'll need at least one other Node Pool in the cluster that you won't scale to zero. This other Node Pool runs the Konnectivity Agent for network connectivity.

To ensure a Node Pool can scale to zero, do the following:

On the Node Pool you want to scale to zero, set minNodes to 0 and maxNodes to a value greater than 0. This setting allows the Node Pool to scale down to zero Nodes when there's no demand.

Create another Node Pool (for example, with a less expensive instance type) and set its manifest so that the required Konnectivity Agents run on it. See the sample Node Pool manifest below that schedules the Konnectivity agents to run on it:

Example

apiVersion: compute.coreweave.com/v1alpha1
kind: NodePool
metadata:
  name: konnectivity-agents
spec:
# NOTE: OTHER FIELDS NOT SHOWN
  nodeLabels:
    cks.coreweave.cloud/system-critical: "true"
Be sure to set nodeLabels on the Konnectivity Node to this value.

Otherwise, on your main Node Pool, set minNodes to at least 2 to ensure that Konnectivity has the required number of Nodes to run and thus won't impact scaling decisions by AutoScaler.

Using scale-to-zero with multiple Node Pools

When you enable scale-to-zero on multiple autoscaled Node Pools with different Node configurations, the autoscaler doesn't know in advance which Node Pool to scale up for a pending Pod. Because no Nodes are running, it can't match the Pod's requirements to an existing Node type.

As a result, if more than one Node Pool is scaled to zero, the autoscaler might first scale up an incompatible Node Pool, fail to schedule the Pod, and then try other Node Pools until it finds a match. These repeated scale-up attempts can delay the Pod's startup.

Monitoring cluster autoscaler

To view logs in CoreWeave Grafana, navigate to Explore and use CoreWeave Logs for logs:

You can search for the string app="cluster-autoscaler":

To view metrics in CoreWeave Grafana, navigate to Explore and use CoreWeave Metrics.

Note that all the metrics are prefixed with cluster_autoscaler_. For more information, see the Kubernetes Cluster Autoscaler Monitoring documentation. To find autoscaling metrics, navigate to CoreWeave Metrics and search cluster_autoscaler:

Test autoscaling

You can test your autoscaling configuration using the following workload. The workload requires all eight GPUs on four Nodes, so if it is run on a Node Pool with less than four Nodes available, Cluster Autoscaler will add the correct number of instances to accommodate the workload.

Note that the workload uses the nodeSelector field to specify the required instance to schedule. When a cluster has multiple Node Pools, the nodeSelector field lets the cluster know which Node Pool to scale.

apiVersion: batch/v1
kind: Job
metadata:
  name: nvidia-l40-gpu-job
spec:
  parallelism: 4
  completions: 4
  template:
    metadata:
      labels:
        app: nvidia-l40-gpu
        gpu.nvidia.com/class: L40
        gpu.nvidia.com/model: L40
        gpu.nvidia.com/vram: "46"
    spec:
      restartPolicy: Never
      containers:
        - name: gpu-app
          image: nvidia/cuda:12.3.0-devel-ubuntu22.04
          command: ["/bin/bash", "-c"]
          args:
            - |
              apt-get update && apt-get install -y build-essential cmake make git && \
              git clone https://github.com/NVIDIA/cuda-samples.git && \
              cd cuda-samples && mkdir build && cd build && \
              cmake .. -DCMAKE_CUDA_ARCHITECTURES=89 && \
              echo "start here" make && ls -la  && \
              cd Samples/1_Utilities/deviceQuery && \
              make && ./deviceQuery && sleep 6000
          resources:
            limits:
              nvidia.com/gpu: 8
      nodeSelector:
        gpu.nvidia.com/class: L40
        gpu.nvidia.com/model: L40
      tolerations:
        - key: "nvidia.com/gpu"
          operator: "Exists"
          effect: "NoSchedule"

Troubleshooting autoscaling behavior

Problem	Potential issue	Suggested fix
Nodes don't scale up.	Node Pool created with `minNodes:0`. When a Node Pool is created with `minNodes: 0`, there are initially no Nodes present in the pool. The Autoscaler requires at least one Node (`targetNodes: 1`) so it can cache the "shape" (resource characteristics) of the Node. This cache is necessary for the Autoscaler to determine if Pods can be scheduled onto the Node Pool in the future. CKS removes a Node, causing the Autoscaler to attempt to schedule a new Node with the wrong "shape". Occasionally, the Autoscaler may cache a Node that has been "tainted" (marked unschedulable) by CoreWeave automation. The cached taint can cause the Autoscaler to incorrectly think it cannot scale up the pool, even if new scheduling requests exist.	Manually set `targetNodes` to `1`. This triggers a Node to be added, updating the cache (or clearing a bad cache entry) and causing a new Node to get scheduled.
Nodes don't scale down.	Konnectivity Agent Replica Scheduling. CKS expects two replicas of the Konnectivity Agent to run for network connectivity. These agent Pods can block the Node Pool from scaling down to zero, or conversely, can trigger unexpected scaling up if there are resource needs in the pool being autoscaled.	Follow the instructions in the Scale-to-zero section for creating a Node Pool for the Konnectivity replicas to run on.

Configure autoscaling​

Autoscaling behavior​

Node selectors and autoscaling​

Autoscaling considerations​

Scale-to-zero​

Using scale-to-zero with multiple Node Pools​

Monitoring cluster autoscaler​

Test autoscaling​

Troubleshooting autoscaling behavior​