> ## Documentation Index
> Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Workload Scheduling on CKS

> Control where Pods are deployed through namespaces, labels, and taints

In <Tooltip tip="CoreWeave Kubernetes Service (CKS) is CoreWeave's managed Kubernetes service." cta="Learn more" href="/glossary#coreweave-kubernetes-service-cks">CoreWeave Kubernetes Service (CKS)</Tooltip>, Nodes are organized using namespaces, labels, taints, and eviction policies. These features control where CoreWeave's core managed services are scheduled and ensure that customer workloads always run on healthy production Nodes.

## CKS namespaces

CKS has two types of namespaces:

* **User namespaces** are labeled with your Org ID.
  * You have full control over user namespaces - you can create, change, and delete them.
* **Control Plane namespaces** are created by CoreWeave to host critical services that run within the cluster.
  * Do not alter or delete these namespaces. CoreWeave workloads are automated within these managed namespaces. Jobs that run in Control Plane namespaces are not billed to the customer.

CKS applies the label `ns.coreweave.cloud/org=control-plane` to all Control Plane namespaces. To view these namespaces in a CKS cluster:

```bash theme={"system"}
kubectl get namespaces --selector=ns.coreweave.cloud/org=control-plane
```

## Node type selection labels

All CoreWeave [GPU](/platform/instances/gpu-instances) and [CPU](/platform/instances/cpu-instances) Nodes feature **Instance IDs**. To ensure consistency, all Nodes within a Node Pool are tagged with an Instance ID using the `instance-type` label. For example:

```yaml theme={"system"}
node.kubernetes.io/instance-type=instance-type-example
```

For a list of all Instance IDs, see the [GPU Instance](/platform/instances/gpu-instances) and [CPU Instance](/platform/instances/cpu-instances) details.

<Note>
  Some customers may be using the `node.kubernetes.io/type` label. This label has been updated to reference the new Instance ID.
</Note>

## Node lifecycle labels

CKS uses labels to identify a Node's state in [the Node lifecycle](/platform/fleet-management/node-lifecycle). These labels ensure that customer workloads are always scheduled on healthy production Nodes.

* `node.coreweave.cloud/state`: Identifies the [Node's lifecycle state](/platform/fleet-management/node-lifecycle), such as `production`, `zap`, or `triage`.
* `node.coreweave.cloud/reserved`: Identifies the workload type running on the Node:
  * If `/reserved` is a customer Org ID and `/state=production`, it's for user workloads.
  * If `/state` **is not** `production`, then `/reserved` matches `/state`.
* `node.coreweave.cloud/reserved-desired`: Overrides `/reserved`. If it doesn't match `/reserved`, the Node marked is pending and will transition Reservations automatically.

## User-provided labels

Customers may create custom Node labels for scheduling or organization, but never in the `*.coreweave.cloud` or `*.coreweave.com` namespaces. Attempts to do so are rejected by CKS.

```yaml theme={"system"}
metadata:
  labels:
    foo.coreweave.cloud/bar: "true"   # Not allowed
    foo.coreweave.com/bar: "true"     # Not allowed
```

## Pod interruption and eviction policies

CKS supports three eviction strategies for Pods: **non-interruptible**, **interruptible**, and **gracefully interruptible**. These strategies determine how CKS handles Pods during Node maintenance, reboots, or scale-down events.

### Summary of eviction strategies

| Strategy                     | Pod Label                                  | Description                                                                                                                                                                                | Note                                                                                          |
| ---------------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --------------------------------------------------------------------------------------------- |
| **Non-interruptible**        | None (default)                             | CKS will not proceed with maintenance or scale-down while these Pods are running. Use for critical training jobs and single-instance stateful apps.                                        | Default behavior for Nodes, ensuring stability and reliability.                               |
| **Interruptible**            | `qos.coreweave.cloud/interruptable`        | CKS terminates Pods and proceeds with the Node action without waiting for their full `terminationGracePeriodSeconds`. Use for stateless workloads that can be restarted without data loss. | Misspelled as `interruptable` for historical reasons; in the `qos.coreweave.cloud` namespace. |
| **Gracefully interruptible** | `qos.coreweave.com/graceful-interruptible` | CKS blocks the Node action until the Pod terminates or its `terminationGracePeriodSeconds` expires. Use for stateful applications that can handle graceful shutdowns.                      | Spelled correctly; in the `qos.coreweave.com` namespace.                                      |

See the sections below for details about each strategy.

### Non-interruptible

Workloads that should not be interrupted, such as critical training jobs and single-instance stateful apps, should not set either of the two labels in the table above. If neither label is set, the Pod is considered non-interruptible.

CKS treats Nodes with non-interruptible Pods as active. This means CKS will not proceed with Node maintenance, reboots, or scale-down while these Pods are running. Non-interruptible Pods are never evicted unless an extreme event occurs, such as complete Node failure or DC power loss. This is the default behavior for Pods.

### Interruptible

Workloads such as inference Pods or stateless applications that can be restarted without data loss should use the interruptible strategy. CKS terminates these Pods and proceeds with the Node action (such as reboot or scale-down) without waiting for the Pod's full `terminationGracePeriodSeconds`.

Never apply this label to distributed training jobs, single-instance databases, stateful services, or any workload that cannot tolerate sudden termination. Evicting these workloads may cause data loss or require costly restarts across multiple Nodes.

To choose this strategy, apply the label `qos.coreweave.cloud/interruptable: "true"` to your Pods. This label is in the `qos.coreweave.cloud` namespace.

```yaml theme={"system"}
metadata:
  labels:
    qos.coreweave.cloud/interruptable: "true"
```

Note that for historical reasons, the label is misspelled as `interruptable`.

### Gracefully interruptible

Workloads that can exit cleanly before Node maintenance begins should consider using the gracefully interruptible strategy. This is particularly useful for stateful applications that can handle graceful shutdowns, such as replicated databases or stateful services.

Unlike interruptible Pods, which are deleted immediately without waiting, gracefully interruptible Pods **block the Node action** (including force reboots) until the Pod terminates or its `terminationGracePeriodSeconds` expires. CKS sends a `DELETE` call to the Pod and waits for the full grace period before proceeding.

To choose this, apply the label `qos.coreweave.com/graceful-interruptible: "true"` to your Pods. This label is in the `qos.coreweave.com` namespace.

```yaml theme={"system"}
metadata:
  labels:
    qos.coreweave.com/graceful-interruptible: "true"
```

### Key behaviors and limitations

#### NodePool scale-down

Nodes hosting `graceful-interruptible` Pods are skipped when determining if the Node is a candidate for removal. This means scale-down may stall if all Pods on candidate Nodes carry this label.

#### Tolerations that prevent graceful eviction

Pods that tolerate either `node.coreweave.cloud/evict=true:NoExecute` or `node.coreweave.cloud/reserved:NoExecute` will not be processed via the `graceful-interruptible` logic and may be evicted immediately. See [Eviction taints](#eviction-taints) below.

#### Drain time differences

* Reboots and Maintenance use a default drain timeout of three minutes, and honor `terminationGracePeriodSeconds` for `graceful-interruptible` Pods.
* CKS-initiated scale-down does not wait for DaemonSets unless they carry `qos.coreweave.com/graceful-interruptible: "true"` and do not tolerate eviction taints.
* Services that need long termination phases should explicitly set `terminationGracePeriodSeconds` accordingly.

#### Potential for scale-down stalls

By design, CKS will never remove a Node containing only `graceful-interruptible` Pods. If every Pod on a candidate Node carries this label, CKS has nowhere to reclaim capacity and will stall waiting for Nodes that can be safely drained. In practice, this can block automated scale-down workflows.

#### Risk of stuck Nodes

If workloads are deployed without accounting for `graceful-interruptible` semantics, Nodes can remain in a quasi-drained state indefinitely. For example, you may cordon a Node for maintenance, then find it never transitions to "Ready" again because all Pods refuse immediate eviction. Left unchecked, these Nodes consume capacity and can complicate rolling updates.

To mitigate these risks:

1. Plan deployment strategies to ensure some `interruptable` Pods exist to give CKS safe eviction candidates.
2. Monitor NodePool capacity and scheduling health. Set up alerts on stalled scale-down events or sustained high utilization to detect when Nodes are being held due to `graceful-interruptible` Pods.
3. Establish maintenance procedures that include manual intervention steps (e.g., draining and deleting problematic Nodes) as a fallback when automated processes cannot reclaim resources.

By weighing the benefits of smoother in-place upgrades against these trade-offs, teams can decide how and when to use `graceful-interruptible` without compromising cluster resilience or cost efficiency.

## Taints and tolerations

CKS uses taints to guard control-plane Nodes and enforce GPU/CPU scheduling.

### Eviction taints

* `node.coreweave.cloud/evict=true:NoExecute`
* `node.coreweave.cloud/reserved:NoExecute`

Pods tolerating these will bypass graceful eviction and may be evicted immediately.

### User taints

The CPU taint (`is_cpu_compute:NoSchedule`) is automatically tolerated by Pods without GPU requests.

```yaml title="CPU taint" theme={"system"}
  - effect: NoSchedule
    key: is_cpu_compute
```

The GPU taint (`is_gpu=true:PreferNoSchedule`) prevents CPU-only Pods from scheduling on GPU Nodes unless necessary. A CPU-only Pod can still schedule on a GPU Node if no CPU Nodes are available.

```yaml title="GPU taint" theme={"system"}
  - effect: PreferNoSchedule
    key: is_gpu
    value: "true"
```

<Danger>
  Customers should exercise caution before attempting to add tolerations to their Pods to ensure workloads always run on healthy Nodes.
</Danger>

## SUNK-specific scheduling

### SUNK's `/lock` taint

To prevent contention with other Pods that request GPU access while long-running `slurmd` Pods are active, SUNK adds a new GPU resource to Kubernetes, `sunk.coreweave.com/accelerator`, in addition to the `nvidia.com/gpu` resource provided by NVIDIA's plugin.

Because the GPU has two different resource names, Kubernetes tracks the consumption separately, which allows Slurm Pods to request the same underlying GPU as other Kubernetes Pods. However, this requires SUNK to manage GPU contention instead of the Kubernetes scheduler.

SUNK manages the contention with a taint called `sunk.coreweave.com/lock`. SUNK applies this taint to Nodes by making a call to `slurm-syncer` during the Prolog phase.

```yaml title="SUNK's lock taint" theme={"system"}
  - effect: NoExecute
    key: sunk.coreweave.com/lock
    value: "true"
```

<Warning>
  Prolog completion is blocked until all Pods that do not tolerate the taint have been evicted.
</Warning>

### DaemonSets on SUNK Nodes

[Kubernetes DaemonSets](https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/) that run on SUNK Nodes must **tolerate** the `sunk.coreweave.com/lock` taint, as well as `is_cpu_compute`, `is_gpu`, and `node.coreweave.cloud/reserved`:

```yaml title="Example toleration" lines highlight={3-11} theme={"system"}
spec:
  tolerations:
  - key: sunk.coreweave.com/lock
    value: "true"
    operator: Equal

  - key: is_cpu_compute
    operator: Exists

  - key: is_gpu
    operator: Exists
```

## Scaling down workloads

To scale down Pods in a specific order, use the Cloud Console or CKS API to adjust the cluster specification. For more, see the [official Kubernetes guide on Pod deletion cost](https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/#pod-deletion-cost).
