Kueue
Run Kueue on CKS
| Chart reference | Description |
|---|---|
coreweave/cks-kueue | CoreWeave's Helm chart for deploying Kueue on CKS clusters |
About Kueue
Kueue is a Kubernetes-native system that manages jobs using quotas. Kueue makes job decisions based on resource availability, job priorities, and the quota policies defined in your cluster queues. Kueue can determine when a job should wait for available resources, when a job should start (Pods created), and when a job should be preempted (active Pods deleted).
CKS supports Kueue out of the box. To make it as easy as possible to get started, CoreWeave provides a Helm Chart for installing Kueue. The cks-kueue Chart also includes a kueue subchart, used to configure Kueue for deployment into your CKS cluster.
When you install Kueue through our Helm chart, Kueue metrics are automatically scraped and ingested into the Kueue Metrics Dashboard in CoreWeave Grafana.
Usage
Add the CoreWeave Helm repo.
$helm repo add coreweave https://charts.core-services.ingress.coreweave.com
Then, install Kueue on your CKS cluster.
$helm install kueue coreweave/cks-kueue --namespace=kueue-system --create-namespace
Sample Kueue configuration
After installing the cks-kueue chart, use the following sample configuration to set up a basic Kueue environment for CKS. This configuration includes several key Kueue components:
ResourceFlavor: Defines the characteristics of compute resources (CPU, memory, GPUs) available in your clusterClusterQueue: Establishes resource quotas and admission policies across your entire clusterLocalQueue: Creates namespaced queues that reference a ClusterQueue for job submissionWorkloadPriorityClass: Defines priority levels for jobs to determine scheduling order and preemption behavior
The configuration also defines two priority classes for different job types: production jobs with high priority and development jobs with lower priority.
# ResourceFlavor defines the compute resources available in your cluster# This flavor represents the standard CKS node configuration---apiVersion: kueue.x-k8s.io/v1beta1kind: ResourceFlavormetadata:name: default-flavor---# ClusterQueue establishes resource quotas and admission policies# This queue allows jobs to consume up to the specified resource limitsapiVersion: kueue.x-k8s.io/v1beta1kind: ClusterQueuemetadata:name: "cluster-queue"spec:# Enable preemption of lower priority jobs when higher priority jobs need resourcespreemption:withinClusterQueue: LowerPriority# Allow jobs from all namespaces to use this queuenamespaceSelector: {} # Match all namespaces.resourceGroups:- coveredResources: ["cpu", "memory", "nvidia.com/gpu", "rdma/ib"]flavors:- name: "default-flavor"resources:- name: "cpu"nominalQuota: 254 # Total CPU cores available- name: "memory"nominalQuota: 2110335488Ki # Total memory available (~2TB)- name: "nvidia.com/gpu"nominalQuota: 16 # Total GPUs available- name: "rdma/ib"nominalQuota: 12800 # Total number of RDMA Nodes available---# LocalQueue creates a namespaced queue for job submission# Jobs submitted to this queue will use the cluster-queue resourcesapiVersion: kueue.x-k8s.io/v1beta1kind: LocalQueuemetadata:namespace: "default"name: "default"spec:clusterQueue: "cluster-queue"---# WorkloadPriorityClass defines priority levels for job scheduling# Higher values = higher priority (jobs with higher priority can preempt lower priority jobs)apiVersion: kueue.x-k8s.io/v1beta1kind: WorkloadPriorityClassmetadata:name: prod-priorityvalue: 1000description: "Priority class for prod jobs"---apiVersion: kueue.x-k8s.io/v1beta1kind: WorkloadPriorityClassmetadata:name: dev-priorityvalue: 100description: "Priority class for development jobs"---
Observability
CoreWeave Grafana provides a Kueue Metrics Dashboard which you can use to monitor your Kueue cluster.
Topology Aware Scheduling (TAS)
Topology-Aware Scheduling allows Kueue to make smarter scheduling decisions by considering the physical topology of your cluster's Nodes. This is important for HPC and AI and ML workloads, where network latency between Nodes can be a performance bottleneck. TAS can co-locate a job's Pods to minimize communication overhead and maximize performance.
The TopologyAwareScheduling feature in the Kueue controller is enabled by default. However, to use it, you need to adjust some of the Kueue resources.
Once the Helm chart is installed and the Kueue CRDs exist, set the following values to create topologies based on CKS Node labels for Kueue to use:
$helm upgrade kueue coreweave/cks-kueue --namespace=kueue-system --values - <<EOFtopologies:- name: infinibandlevels:- backend.coreweave.cloud/fabric- backend.coreweave.cloud/leafgroup- name: multinode-nvlink-iblevels:- backend.coreweave.cloud/fabric- backend.coreweave.cloud/leafgroup- ds.coreweave.com/nvlink.domainEOF
-
infinibandtopology is for instance types that are a part of infiniband fabrics like the H100, H200, and B200. -
multinode-nvlink-ibtopology extends theinfinibandtopology to also include instance types with rack-scale NVLINK like the GB200s.
After the Helm chart is upgraded, you will see the new Topology CRs deployed in the cluster.
The following example configuration is an adjustment of the one shown above. It demonstrates how to use the Topology resources by referencing them in ResourceFlavor resources, which are then used by ClusterQueue and LocalQueue resources.
---apiVersion: kueue.x-k8s.io/v1beta1kind: ResourceFlavormetadata:name: infiniband-flavorspec:topologyName: infiniband # References the infiniband Topology CRnodeLabels:backend.coreweave.cloud/flavor: "infiniband"---# This flavor enables topology-aware scheduling across NVLINK domainsapiVersion: kueue.x-k8s.io/v1beta1kind: ResourceFlavormetadata:name: gb200-flavorspec:topologyName: multinode-nvlink-ib # References the multinode-nvlink-ib Topology CRnodeLabels:node.kubernetes.io/instance-type: gb200-4x---# ClusterQueue for infiniband-connected workloadsapiVersion: kueue.x-k8s.io/v1beta1kind: ClusterQueuemetadata:name: "infiniband-queue"spec:preemption:withinClusterQueue: LowerPrioritynamespaceSelector: {}resourceGroups:- coveredResources: ["cpu", "memory", "nvidia.com/gpu", "rdma/ib"]flavors:- name: "infiniband-flavor"resources:- name: "cpu"nominalQuota: 2048 # 16 nodes * 128 vCPU per node- name: "memory"nominalQuota: 34359738368Ki # 16 nodes * 2Ti per node = 32Ti- name: "nvidia.com/gpu"nominalQuota: 128 # 16 nodes * 8 GPUs per node- name: "rdma/ib"nominalQuota: 12800---# ClusterQueue for GB200 workloads with multinode-NVLINKapiVersion: kueue.x-k8s.io/v1beta1kind: ClusterQueuemetadata:name: "gb200-queue"spec:preemption:withinClusterQueue: LowerPrioritynamespaceSelector: {}resourceGroups:- coveredResources: ["cpu", "memory", "nvidia.com/gpu", "rdma/ib"]flavors:- name: "gb200-flavor"resources:- name: "cpu"nominalQuota: 2304 # 16 nodes * 144 vCPU per node- name: "memory"nominalQuota: 15000000000Ki # 16 nodes * 960 GB per node = 15.36 TB- name: "nvidia.com/gpu"nominalQuota: 64 # 16 nodes * 4 GPUs per node- name: "rdma/ib"nominalQuota: 12800---# LocalQueue for infiniband workloads in the default namespaceapiVersion: kueue.x-k8s.io/v1beta1kind: LocalQueuemetadata:namespace: "default"name: "infiniband-local"spec:clusterQueue: "infiniband-queue"---# LocalQueue for GB200 workloads in the default namespaceapiVersion: kueue.x-k8s.io/v1beta1kind: LocalQueuemetadata:namespace: "default"name: "gb200-local"spec:clusterQueue: "gb200-queue"---# WorkloadPriorityClass definitions (same as basic example)apiVersion: kueue.x-k8s.io/v1beta1kind: WorkloadPriorityClassmetadata:name: prod-priorityvalue: 1000description: "Priority class for prod jobs"---apiVersion: kueue.x-k8s.io/v1beta1kind: WorkloadPriorityClassmetadata:name: dev-priorityvalue: 100description: "Priority class for development jobs"---
Example Jobs with Topology Constraints
You can use the kueue.x-k8s.io/podset-required-topology annotation to ensure that all Pods in a job are scheduled within the same topology domain.
Example: Four Pods on One Leafgroup (Infiniband Queue)
This example schedules four Pods within a single leafgroup:
apiVersion: batch/v1kind: Jobmetadata:name: test-tas-jobnamespace: defaultlabels:kueue.x-k8s.io/queue-name: infiniband-localspec:parallelism: 4completions: 4template:metadata:annotations:kueue.x-k8s.io/podset-required-topology: "backend.coreweave.cloud/leafgroup"spec:containers:- name: trainingimage: busyboxcommand: ["sleep", "30s"]resources:requests:cpu: "32"memory: "256Gi"nvidia.com/gpu: "8"rdma/ib: "1"limits:cpu: "32"memory: "256Gi"nvidia.com/gpu: "8"rdma/ib: "1"restartPolicy: Never
Example: Four Pods on One NVLINK Domain (GB200 Queue)
This example schedules four Pods within a single NVLINK domain for GB200 Nodes:
apiVersion: batch/v1kind: Jobmetadata:name: gb200-test-taslabels:kueue.x-k8s.io/queue-name: gb200-localspec:parallelism: 4completions: 4template:metadata:annotations:kueue.x-k8s.io/podset-required-topology: "ds.coreweave.com/nvlink.domain"spec:containers:- name: trainingimage: your-training-image:latestresources:requests:cpu: "32"memory: "256Gi"nvidia.com/gpu: "4"rdma/ib: "1"limits:cpu: "32"memory: "256Gi"nvidia.com/gpu: "4"rdma/ib: "1"restartPolicy: Never