> ## Documentation Index
> Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Kueue

> Install and configure Kueue for Kubernetes-native job queuing on CKS using CoreWeave's Helm chart

| Chart reference       | Description                                                |
| --------------------- | ---------------------------------------------------------- |
| `coreweave/cks-kueue` | CoreWeave's Helm chart for deploying Kueue on CKS clusters |

## About Kueue

[Kueue](https://kueue.sigs.k8s.io/docs/overview/) is a Kubernetes-native system that manages jobs using quotas. Kueue makes job decisions based on resource availability, job priorities, and the quota policies defined in your cluster queues. Kueue determines when a job waits for available resources, when a job starts (Pods created), and when a job is preempted (active Pods deleted).

Use Kueue on CKS to prioritize batch, AI, and ML workloads, share GPU capacity fairly across teams, and reduce idle time on expensive accelerators.

CKS supports Kueue by default. To simplify getting started, CoreWeave provides a Helm chart for installing Kueue. The `cks-kueue` chart also includes a `kueue` subchart, used to configure Kueue for deployment into your CKS cluster.

<Warning>
  Always install Kueue with CoreWeave's `cks-kueue` chart. Upstream Kueue registers admission webhooks that intercept system and infrastructure Pods, which can deadlock the cluster and prevent it from starting. The `cks-kueue` chart configures the webhooks to exclude these Pods, so the cluster can always start.
</Warning>

<Info>
  When you install Kueue through the CoreWeave Helm chart, Kueue metrics are automatically scraped and ingested into the [Kueue Metrics Dashboard in CoreWeave Grafana](/observability/managed-grafana/kubernetes/kueue-metrics).
</Info>

## Usage

Install the `cks-kueue` Helm chart to deploy Kueue into your CKS cluster. The chart installs the Kueue controller and CRDs so you can begin defining queues and submitting workloads.

Add the CoreWeave Helm repo so Helm can locate the `cks-kueue` chart.

```bash theme={"system"}
helm repo add coreweave https://charts.core-services.ingress.coreweave.com
```

Then, install Kueue on your CKS cluster.

```bash theme={"system"}
helm install kueue coreweave/cks-kueue --namespace=kueue-system --create-namespace
```

After the chart installs, the Kueue controller runs in the `kueue-system` namespace and the Kueue CRDs are available for use in your cluster.

## Sample Kueue configuration

After installing the `cks-kueue` chart, use the following sample configuration to set up a basic Kueue environment for CKS. This configuration includes several key Kueue components:

* **`ResourceFlavor`**: Defines the characteristics of compute resources (CPU, memory, GPUs) available in your cluster.
* **`ClusterQueue`**: Establishes resource quotas and admission policies across your entire cluster.
* **`LocalQueue`**: Creates namespaced queues that reference a `ClusterQueue` for job submission.
* **`WorkloadPriorityClass`**: Defines priority levels for jobs to determine scheduling order and preemption behavior.

The configuration also defines two priority classes for different job types: production jobs with high priority and development jobs with lower priority.

```yaml theme={"system"}
# ResourceFlavor defines the compute resources available in your cluster
# This flavor represents the standard CKS Node configuration
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: default-flavor
---
# ClusterQueue establishes resource quotas and admission policies
# This queue allows jobs to consume up to the specified resource limits
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "cluster-queue"
spec:
  # Enable preemption of lower priority jobs when higher priority jobs need resources
  preemption:
    withinClusterQueue: LowerPriority
  # Allow jobs from all namespaces to use this queue
  namespaceSelector: {} # Match all namespaces.
  resourceGroups:
    - coveredResources: ["cpu", "memory", "nvidia.com/gpu", "rdma/ib"]
      flavors:
        - name: "default-flavor"
          resources:
            - name: "cpu"
              nominalQuota: 254          # Total CPU cores available
            - name: "memory"
              nominalQuota: 2110335488Ki # Total memory available (~2TB)
            - name: "nvidia.com/gpu"
              nominalQuota: 16           # Total GPUs available
            - name: "rdma/ib"
              nominalQuota: 12800        # Total number of RDMA Nodes available
---
# LocalQueue creates a namespaced queue for job submission
# Jobs submitted to this queue will use the cluster-queue resources
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  namespace: "default"
  name: "default"
spec:
  clusterQueue: "cluster-queue"
---
# WorkloadPriorityClass defines priority levels for job scheduling
# Higher values = higher priority (jobs with higher priority can preempt lower priority jobs)
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
  name: prod-priority
value: 1000
description: "Priority class for prod jobs"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
  name: dev-priority
value: 100
description: "Priority class for development jobs"
---
```

<Info>
  For more information, see [Using Kueue with Ray on CKS](/products/cks/clusters/frameworks/ray-with-kueue).
</Info>

## Observability

This section describes how to monitor Kueue activity after the chart is installed.

CoreWeave Grafana provides a [Kueue Metrics Dashboard](/observability/managed-grafana/kubernetes/kueue-metrics) which you can use to monitor your Kueue cluster.

## Topology-aware scheduling

Topology-aware scheduling (TAS) lets Kueue improve scheduling decisions by considering the physical topology of your cluster's Nodes. This is important for HPC, AI, and ML workloads, where network latency between Nodes can be a performance bottleneck. TAS can co-locate a job's Pods to minimize communication overhead and maximize performance.

The `TopologyAwareScheduling` feature in the Kueue controller is enabled by default. However, to use it, you must adjust some of the Kueue resources so that Kueue references the Node labels that describe your cluster's topology.

After the Helm chart is installed and the Kueue CRDs exist, choose one of the following topologies based on CKS Node labels for Kueue to use:

* The `infiniband` topology is for instance types that are a part of InfiniBand fabrics. See [About GPU instances](../../../../platform/instances/gpu-instances) to find InfiniBand connected instances.
* The `multinode-nvlink-ib` topology extends the `infiniband` topology to also include instance types with rack-scale NVLink. See [About GPU instances](../../../../platform/instances/gpu-instances) to find NVLink connected rack instances.
* The `hostname` topology is for instance types without InfiniBand fabrics. It prevents Kueue from admitting workloads when total capacity is sufficient but fragmented across Nodes.

```bash theme={"system"}
helm upgrade kueue coreweave/cks-kueue --namespace=kueue-system --values - <<EOF
topologies:
  - name: infiniband
    levels:
      - backend.coreweave.cloud/fabric
      - backend.coreweave.cloud/superpod
      - backend.coreweave.cloud/leafgroup
      - kubernetes.io/hostname
  - name: multinode-nvlink-ib
    levels:
      - backend.coreweave.cloud/fabric
      - backend.coreweave.cloud/superpod
      - backend.coreweave.cloud/leafgroup
      - ds.coreweave.com/nvlink.domain
      - kubernetes.io/hostname
  - name: hostname
    levels:
      - kubernetes.io/hostname
EOF
```

After you upgrade the Helm chart, the new `Topology` CRs appear in the cluster.

The following example configuration is an adjustment of the preceding one. It demonstrates how to use the `Topology` resources by referencing them in `ResourceFlavor` resources, which are then used by `ClusterQueue` and `LocalQueue` resources.

Each `ResourceFlavor` must select a disjoint set of Nodes. Use `nodeLabels` so that no Node matches more than one flavor. Overlapping selectors cause a Node to belong to multiple flavors, which leads to ambiguous quota accounting and unpredictable scheduling.

Match each `topologyName` to a Node pool whose hardware actually provides that topology. For example, pair the `infiniband` topology with an InfiniBand connected pool like B200, the `multinode-nvlink-ib` topology with a rack-scale NVLink pool like GB200, and the `hostname` topology with pools that lack InfiniBand.

<Warning>
  Referencing a topology whose labels are not present on the selected Nodes causes Kueue to fail to admit workloads for that flavor. Verify that every label in the `Topology` `levels` list exists on the Nodes selected by the flavor's `nodeLabels`.
</Warning>

```yaml theme={"system"}
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: b200-flavor
spec:
  topologyName: infiniband # References the infiniband Topology CR
  nodeLabels:
    compute.coreweave.com/node-pool: b200-nodepool
---
# This flavor enables topology-aware scheduling across NVLINK domains
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: gb200-flavor
spec:
  topologyName: multinode-nvlink-ib # References the multinode-nvlink-ib Topology CR
  nodeLabels:
    compute.coreweave.com/node-pool: gb200-nodepool-1
    compute.coreweave.com/node-pool: gb200-nodepool-2
---
# ClusterQueue for InfiniBand-connected workloads
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "b200-queue"
spec:
  preemption:
    withinClusterQueue: LowerPriority
  namespaceSelector: {}
  resourceGroups:
    - coveredResources: ["cpu", "memory", "nvidia.com/gpu", "rdma/ib"]
      flavors:
        - name: "b200-flavor"
          resources:
            - name: "cpu"
              nominalQuota: 2048 # 16 Nodes * 128 vCPU per Node
            - name: "memory"
              nominalQuota: 34359738368Ki # 16 Nodes * 2Ti per Node = 32Ti
            - name: "nvidia.com/gpu"
              nominalQuota: 128 # 16 Nodes * 8 GPUs per Node
            - name: "rdma/ib"
              nominalQuota: 12800
---
# ClusterQueue for GB200 workloads with multinode-NVLINK
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "gb200-queue"
spec:
  preemption:
    withinClusterQueue: LowerPriority
  namespaceSelector: {}
  resourceGroups:
    - coveredResources: ["cpu", "memory", "nvidia.com/gpu", "rdma/ib"]
      flavors:
        - name: "gb200-flavor"
          resources:
            - name: "cpu"
              nominalQuota: 2304 # 16 Nodes * 144 vCPU per Node
            - name: "memory"
              nominalQuota: 15000000000Ki # 16 Nodes * 960 GB per Node = 15.36 TB
            - name: "nvidia.com/gpu"
              nominalQuota: 64 # 16 Nodes * 4 GPUs per Node
            - name: "rdma/ib"
              nominalQuota: 12800
---
# LocalQueue for InfiniBand workloads in the default namespace
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  namespace: "default"
  name: "b200-local"
spec:
  clusterQueue: "b200-queue"
---
# LocalQueue for GB200 workloads in the default namespace
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  namespace: "default"
  name: "gb200-local"
spec:
  clusterQueue: "gb200-queue"
---
# WorkloadPriorityClass definitions (same as basic example)
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
  name: prod-priority
value: 1000
description: "Priority class for prod jobs"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
  name: dev-priority
value: 100
description: "Priority class for development jobs"
---
```

### Example jobs with topology constraints

Kueue provides several annotations for expressing topology constraints on a job. Two of the most common are:

* `kueue.x-k8s.io/podset-required-topology` requires that all Pods in a job are scheduled within the same topology domain. The job stays pending until a domain can fit it.
* `kueue.x-k8s.io/podset-preferred-topology` treats the topology domain as best effort. Kueue tries to fit all Pods within the same domain, but falls back to spreading across domains if needed.

For the full list of annotations and their semantics, see the [Kueue Topology Aware Scheduling user-facing APIs](https://kueue.sigs.k8s.io/docs/concepts/topology_aware_scheduling/#user-facing-apis).

#### Four Pods on one leafgroup with the InfiniBand queue

This example schedules four Pods within a single leafgroup:

```yaml theme={"system"}
apiVersion: batch/v1
kind: Job
metadata:
  name: test-tas-job
  namespace: default
  labels:
    kueue.x-k8s.io/queue-name: b200-local
spec:
  parallelism: 4
  completions: 4
  template:
    metadata:
      annotations:
        kueue.x-k8s.io/podset-required-topology: "backend.coreweave.cloud/leafgroup"
    spec:
      containers:
        - name: training
          image: busybox
          command: ["sleep", "30s"]
          resources:
            requests:
              cpu: "32"
              memory: "256Gi"
              nvidia.com/gpu: "8"
              rdma/ib: "1"
            limits:
              cpu: "32"
              memory: "256Gi"
              nvidia.com/gpu: "8"
              rdma/ib: "1"
      restartPolicy: Never
```

#### Four Pods on one NVLINK domain with the GB200 queue

This example schedules four Pods within a single NVLINK domain for GB200 Nodes:

```yaml theme={"system"}
apiVersion: batch/v1
kind: Job
metadata:
  name: gb200-test-tas
  labels:
    kueue.x-k8s.io/queue-name: gb200-local
spec:
  parallelism: 4
  completions: 4
  template:
    metadata:
      annotations:
        kueue.x-k8s.io/podset-required-topology: "ds.coreweave.com/nvlink.domain"
    spec:
      containers:
        - name: training
          image: your-training-image:latest
          resources:
            requests:
              cpu: "32"
              memory: "256Gi"
              nvidia.com/gpu: "4"
              rdma/ib: "1"
            limits:
              cpu: "32"
              memory: "256Gi"
              nvidia.com/gpu: "4"
              rdma/ib: "1"
      restartPolicy: Never
```
