> ## Documentation Index
> Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Workload scheduling on CKS

> Control where CKS deploys Pods through namespaces, labels, and taints

In <Tooltip tip="CoreWeave Kubernetes Service (CKS) is CoreWeave's managed Kubernetes service." cta="Learn more" href="/glossary#coreweave-kubernetes-service-cks">CoreWeave Kubernetes Service (CKS)</Tooltip>, namespaces, labels, taints, and eviction policies organize Nodes. These features control where CoreWeave schedules its core managed services and ensure that customer workloads always run on healthy production Nodes.

This page is a reference for cluster operators and workload owners who need to understand how CKS schedules Pods, which labels and taints CoreWeave reserves, and how to choose an eviction strategy that matches a workload's tolerance for interruption.

## CKS namespaces

CKS has two types of namespaces:

* **User namespaces** carry your Org ID label. You have full control over user namespaces. You can create, change, and delete them.
* **Control Plane namespaces** host critical services that run within the cluster. CoreWeave creates these namespaces. Don't alter or delete them. CoreWeave automates workloads within these managed namespaces and doesn't bill customers for jobs that run in Control Plane namespaces.

CKS applies the label `ns.coreweave.cloud/org=control-plane` to all Control Plane namespaces. To view these namespaces in a CKS cluster:

```bash theme={"system"}
kubectl get namespaces --selector=ns.coreweave.cloud/org=control-plane
```

## Node type selection labels

All CoreWeave [GPU](/platform/instances/gpu-instances) and [CPU](/platform/instances/cpu-instances) Nodes feature Instance IDs. To ensure consistency, every Node within a Node Pool carries an Instance ID through the `instance-type` label. For example:

```yaml theme={"system"}
node.kubernetes.io/instance-type=instance-type-example
```

For a list of all Instance IDs, see the [GPU Instance](/platform/instances/gpu-instances) and [CPU Instance](/platform/instances/cpu-instances) details.

<Note>
  Some customers may use the `node.kubernetes.io/type` label. CoreWeave updated this label to reference the new Instance ID.
</Note>

## Node lifecycle labels

CKS uses labels to identify a Node's state in [the Node lifecycle](/platform/fleet-management/node-lifecycle). These labels ensure that CKS always schedules customer workloads on healthy production Nodes.

* `node.coreweave.cloud/state`: Identifies the [Node's lifecycle state](/platform/fleet-management/node-lifecycle), such as `production`, `zap`, or `triage`.
* `node.coreweave.cloud/reserved`: Identifies the workload type running on the Node:
  * If `/reserved` is a customer Org ID and `/state=production`, it's for user workloads.
  * If `/state` is not `production`, then `/reserved` matches `/state`.
* `node.coreweave.cloud/reserved-desired`: Overrides `/reserved`. If it doesn't match `/reserved`, the marked Node is pending and transitions Reservations automatically.

## User-provided labels

Customers may create custom Node labels for scheduling or organization, but never in the `*.coreweave.cloud` or `*.coreweave.com` namespaces. CKS rejects attempts to do so.

```yaml theme={"system"}
metadata:
  labels:
    foo.coreweave.cloud/bar: "true"   # Not allowed
    foo.coreweave.com/bar: "true"     # Not allowed
```

## Pod interruption and eviction policies

CKS supports three eviction strategies for Pods: **non-interruptible**, **interruptible**, and **gracefully interruptible**. These strategies determine how CKS handles Pods during Node maintenance, reboots, or scale-down events.

### Summary of eviction strategies

| Strategy                     | Pod label                                  | Description                                                                                                                                                                                                                                                                                          | Note                                                                                          |
| ---------------------------- | ------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- |
| **Non-interruptible**        | None (default)                             | CKS doesn't proceed with maintenance or scale-down while these Pods are running. Use for critical training jobs and single-instance stateful apps.                                                                                                                                                   | Default behavior for Nodes, ensuring stability and reliability.                               |
| **Interruptible**            | `qos.coreweave.cloud/interruptable`        | CKS terminates Pods and proceeds with the Node action without waiting for their full `terminationGracePeriodSeconds`. Use for stateless workloads that you can restart without data loss.                                                                                                            | Misspelled as `interruptable` for historical reasons. In the `qos.coreweave.cloud` namespace. |
| **Gracefully interruptible** | `qos.coreweave.com/graceful-interruptible` | CKS blocks the Node action, including reboots, until the Pod terminates or its `terminationGracePeriodSeconds` expires. Use *only* for stateful applications that block rebooting until the application terminates, such as databases and local storage. For all other workloads, use interruptible. | Spelled correctly. In the `qos.coreweave.com` namespace.                                      |

The following sections describe each strategy in detail.

### Non-interruptible

Workloads that you don't want interrupted, such as critical training jobs and single-instance stateful apps, shouldn't set either of the two labels in the preceding table. If neither label is set, the Pod is considered non-interruptible.

CKS treats Nodes with non-interruptible Pods as active. This means CKS doesn't proceed with Node maintenance, reboots, or scale-down while these Pods are running. CKS never evicts non-interruptible Pods unless an extreme event occurs, such as complete Node failure or DC power loss. This is the default behavior for Pods.

### Long-running GPU Pods

A Pod that holds GPU resources (`nvidia.com/gpu`) indefinitely without an interruptible label has two consequences:

* **Deferred maintenance stalls**: CKS holds certain maintenance operations, such as removing a Node with elevated InfiniBand link flaps, until the Node becomes idle. If the Pod never stops, those operations never run. Critical hardware failures still trigger immediate action regardless of Pod state.
* **Prevents active health checks**: CoreWeave runs [hourly GPU health checks](/platform/fleet-management/node-lifecycle/day2#active-health-checks) on Nodes where all GPUs are available. A long-running Pod that holds GPUs prevents health checks from scheduling on that Node for the duration of the Pod's lifetime.

To allow deferred maintenance to proceed, apply the [`graceful-interruptible`](#gracefully-interruptible) or [`interruptible`](#interruptible) label to Pods that can tolerate interruption. CKS excludes Pods with these labels when determining Node idle status, so the Node can receive maintenance operations.

To allow health checks to run on a Node alongside a long-running GPU Pod, the Pod must not hold any GPUs. SUNK addresses this by introducing a separate `sunk.coreweave.com/accelerator` resource for Slurm Pods, leaving the Node's GPUs available for health checks. See [SUNK-specific scheduling](#sunk-specific-scheduling) for details.

### Interruptible

Workloads such as inference Pods or stateless applications that you can restart without data loss should use the interruptible strategy. CKS terminates these Pods and proceeds with the Node action (such as reboot or scale-down) without waiting for the Pod's full `terminationGracePeriodSeconds`.

<Warning>
  Never apply this label to distributed training jobs, single-instance databases, stateful services, or any workload that can't tolerate sudden termination. Evicting these workloads may cause data loss or require costly restarts across multiple Nodes.
</Warning>

To choose this strategy, apply the label `qos.coreweave.cloud/interruptable: "true"` to your Pods. This label is in the `qos.coreweave.cloud` namespace.

```yaml theme={"system"}
metadata:
  labels:
    qos.coreweave.cloud/interruptable: "true"
```

Note that for historical reasons, the label is misspelled as `interruptable`.

### Gracefully interruptible

Use the gracefully interruptible strategy only for stateful applications that block a reboot until the application terminates, such as databases and local storage. These workloads need to finish writing data or hand off state before a Node reboots or drains. For all other applications, including stateless workloads that can be restarted without data loss, use the [interruptible](#interruptible) strategy instead.

Unlike interruptible Pods, which CKS deletes immediately without waiting, gracefully interruptible Pods block the Node action (including force reboots) until the Pod terminates or its `terminationGracePeriodSeconds` expires. CKS sends a `DELETE` call to the Pod and waits for the full grace period before proceeding.

To choose this, apply the label `qos.coreweave.com/graceful-interruptible: "true"` to your Pods. This label is in the `qos.coreweave.com` namespace.

```yaml theme={"system"}
metadata:
  labels:
    qos.coreweave.com/graceful-interruptible: "true"
```

### Key behaviors and limitations

The following sections describe behaviors and limitations to consider when using `graceful-interruptible` Pods.

#### NodePool scale-down

CKS skips Nodes hosting `graceful-interruptible` Pods when it determines whether the Node is a candidate for removal. This means scale-down may stall if every Pod on candidate Nodes carries this label.

#### Tolerations that prevent graceful eviction

Pods that tolerate either `node.coreweave.cloud/evict=true:NoExecute` or `node.coreweave.cloud/reserved:NoExecute` don't go through the `graceful-interruptible` logic and may be evicted immediately. See [Eviction taints](#eviction-taints).

#### Drain time differences

* Reboots and maintenance use a default drain timeout of 3 minutes, and honor `terminationGracePeriodSeconds` for `graceful-interruptible` Pods.
* CKS-initiated scale-down doesn't wait for DaemonSets unless they carry `qos.coreweave.com/graceful-interruptible: "true"` and don't tolerate eviction taints.
* For services that require long termination phases, explicitly set `terminationGracePeriodSeconds` accordingly.

#### Potential for scale-down stalls

By design, CKS never removes a Node that contains only `graceful-interruptible` Pods. If every Pod on a candidate Node carries this label, CKS has nowhere to reclaim capacity and stalls while it waits for Nodes that can be safely drained. In practice, this can block automated scale-down workflows.

#### Risk of stuck Nodes

If you deploy workloads without accounting for `graceful-interruptible` semantics, Nodes can remain in a quasi-drained state indefinitely. For example, you may cordon a Node for maintenance, then find it never transitions to "Ready" again because every Pod refuses immediate eviction. Left unchecked, these Nodes consume capacity and can complicate rolling updates.

To mitigate these risks, follow these recommendations:

1. Plan deployment strategies to ensure some `interruptable` Pods exist as safe eviction candidates for CKS.
2. Monitor NodePool capacity and scheduling health. Set up alerts on stalled scale-down events or sustained high utilization to detect when `graceful-interruptible` Pods hold Nodes.
3. Establish maintenance procedures that include manual intervention steps (for example, draining and deleting problematic Nodes) as a fallback when automated processes can't reclaim resources.

Weigh the benefits of smoother in-place upgrades against these trade-offs to decide how and when to use `graceful-interruptible` without compromising cluster resilience or cost efficiency.

## Taints and tolerations

CKS uses taints to guard control-plane Nodes and enforce GPU and CPU scheduling. The following sections describe the eviction taints CKS applies to Nodes and the user-facing taints that route Pods to the correct hardware.

### Eviction taints

CKS applies the following eviction taints to Nodes:

* `node.coreweave.cloud/evict=true:NoExecute`
* `node.coreweave.cloud/reserved:NoExecute`

Pods that tolerate these bypass graceful eviction and may be evicted immediately.

### User taints

Pods without GPU requests automatically tolerate the CPU taint (`is_cpu_compute:NoSchedule`).

```yaml title="CPU taint" theme={"system"}
  - effect: NoSchedule
    key: is_cpu_compute
```

The GPU taint (`is_gpu=true:PreferNoSchedule`) prevents CPU-only Pods from scheduling on GPU Nodes unless necessary. A CPU-only Pod can still schedule on a GPU Node if no CPU Nodes are available.

```yaml title="GPU taint" theme={"system"}
  - effect: PreferNoSchedule
    key: is_gpu
    value: "true"
```

<Danger>
  Use caution before adding tolerations to your Pods, so workloads continue to run on healthy Nodes.
</Danger>

## SUNK-specific scheduling

The following sections describe scheduling behaviors specific to SUNK workloads on CKS.

### The SUNK `/lock` taint

To prevent contention with other Pods that request GPU access while long-running `slurmd` Pods are active, SUNK adds a new GPU resource to Kubernetes, `sunk.coreweave.com/accelerator`, in addition to the `nvidia.com/gpu` resource provided by NVIDIA's plugin.

Because the GPU has two different resource names, Kubernetes tracks the consumption separately, which lets Slurm Pods request the same underlying GPU as other Kubernetes Pods. However, this requires SUNK to manage GPU contention instead of the Kubernetes scheduler.

SUNK manages the contention with a taint called `sunk.coreweave.com/lock`. SUNK applies this taint to Nodes through a call to `slurm-syncer` during the Prolog phase.

```yaml title="SUNK's lock taint" theme={"system"}
  - effect: NoExecute
    key: sunk.coreweave.com/lock
    value: "true"
```

<Warning>
  Prolog completion blocks until SUNK evicts every Pod that doesn't tolerate the taint.
</Warning>

### DaemonSets on SUNK Nodes

[Kubernetes DaemonSets](https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/) that run on SUNK Nodes must tolerate the `sunk.coreweave.com/lock` taint, as well as `is_cpu_compute`, `is_gpu`, and `node.coreweave.cloud/reserved`:

```yaml title="Example toleration" lines highlight={3-11} theme={"system"}
spec:
  tolerations:
  - key: sunk.coreweave.com/lock
    value: "true"
    operator: Equal

  - key: is_cpu_compute
    operator: Exists

  - key: is_gpu
    operator: Exists
```

## Scale down workloads

To scale down Pods in a specific order, use the Cloud Console or CKS API to adjust the cluster specification. For more information, see the [Kubernetes guide about Pod deletion cost](https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/#pod-deletion-cost).

## Troubleshoot Pod scheduling

If a GPU Pod stays `Pending`, see [Why are my Pods not scheduling on GPU Nodes?](/support/cks/articles/why-are-my-pods-not-scheduling-on-gpu-nodes) for common causes and fixes.

### Resource requests for GPU Pods

Size CPU and memory `requests` to roughly the per-GPU share of the instance, leaving headroom for DaemonSets and system overhead. Specify GPUs under `limits` as `nvidia.com/gpu: [COUNT]`. Kubernetes mirrors the value to `requests` automatically. For a worked whole-Node example, see [Target specific GPUs or CPUs](/products/cks/nodes/manage#target-specific-gpus-or-cpus).

For long-running training and inference, set CPU and memory `requests` equal to `limits` so the Pod gets the Guaranteed Quality-of-Service (QoS) class and is the last to be evicted under resource pressure. See [How do Kubernetes QoS classes work?](/support/cks/articles/how-do-kubernetes-qos-classes-work).

For recommended ratios and how requests behave on GPU Pods, see [What are the recommended resource requests and limits for GPU Pods?](/support/cks/articles/what-are-the-recommended-resource-requests-and-limits-for-gpu-pods) and [How do CPU and memory requests work with GPU Pods?](/support/cks/articles/how-do-cpu-and-memory-requests-work-with-gpu-pods).
