Workload scheduling on CKS

In , namespaces, labels, taints, and eviction policies organize Nodes. These features control where CoreWeave schedules its core managed services and ensure that customer workloads always run on healthy production Nodes. This page is a reference for cluster operators and workload owners who need to understand how CKS schedules Pods, which labels and taints CoreWeave reserves, and how to choose an eviction strategy that matches a workload’s tolerance for interruption.

CKS namespaces

CKS has two types of namespaces:

User namespaces carry your Org ID label. You have full control over user namespaces. You can create, change, and delete them.
Control Plane namespaces host critical services that run within the cluster. CoreWeave creates these namespaces. Don’t alter or delete them. CoreWeave automates workloads within these managed namespaces and doesn’t bill customers for jobs that run in Control Plane namespaces.

CKS applies the label ns.coreweave.cloud/org=control-plane to all Control Plane namespaces. To view these namespaces in a CKS cluster:

kubectl get namespaces --selector=ns.coreweave.cloud/org=control-plane

Node type selection labels

All CoreWeave GPU and CPU Nodes feature Instance IDs. To ensure consistency, every Node within a Node Pool carries an Instance ID through the instance-type label. For example:

node.kubernetes.io/instance-type=instance-type-example

For a list of all Instance IDs, see the GPU Instance and CPU Instance details.

Some customers may use the node.kubernetes.io/type label. CoreWeave updated this label to reference the new Instance ID.

Node lifecycle labels

CKS uses labels to identify a Node’s state in the Node lifecycle. These labels ensure that CKS always schedules customer workloads on healthy production Nodes.

node.coreweave.cloud/state: Identifies the Node’s lifecycle state, such as production, zap, or triage.
node.coreweave.cloud/reserved: Identifies the workload type running on the Node:
- If /reserved is a customer Org ID and /state=production, it’s for user workloads.
- If /state is not production, then /reserved matches /state.
node.coreweave.cloud/reserved-desired: Overrides /reserved. If it doesn’t match /reserved, the marked Node is pending and transitions Reservations automatically.

User-provided labels

Customers may create custom Node labels for scheduling or organization, but never in the *.coreweave.cloud or *.coreweave.com namespaces. CKS rejects attempts to do so.

metadata:
  labels:
    foo.coreweave.cloud/bar: "true"   # Not allowed
    foo.coreweave.com/bar: "true"     # Not allowed

Pod interruption and eviction policies

CKS supports three eviction strategies for Pods: non-interruptible, interruptible, and gracefully interruptible. These strategies determine how CKS handles Pods during Node maintenance, reboots, or scale-down events.

Summary of eviction strategies

Strategy	Pod label	Description	Note
Non-interruptible	None (default)	CKS doesn’t proceed with maintenance or scale-down while these Pods are running. Use for critical training jobs and single-instance stateful apps.	Default behavior for Nodes, ensuring stability and reliability.
Interruptible	`qos.coreweave.cloud/interruptable`	CKS terminates Pods and proceeds with the Node action without waiting for their full `terminationGracePeriodSeconds`. Use for stateless workloads that you can restart without data loss.	Misspelled as `interruptable` for historical reasons. In the `qos.coreweave.cloud` namespace.
Gracefully interruptible	`qos.coreweave.com/graceful-interruptible`	CKS blocks the Node action, including reboots, until the Pod terminates or its `terminationGracePeriodSeconds` expires. Use only for stateful applications that block rebooting until the application terminates, such as databases and local storage. For all other workloads, use interruptible.	Spelled correctly. In the `qos.coreweave.com` namespace.

The following sections describe each strategy in detail.

Non-interruptible

Workloads that you don’t want interrupted, such as critical training jobs and single-instance stateful apps, shouldn’t set either of the two labels in the preceding table. If neither label is set, the Pod is considered non-interruptible. CKS treats Nodes with non-interruptible Pods as active. This means CKS doesn’t proceed with Node maintenance, reboots, or scale-down while these Pods are running. CKS never evicts non-interruptible Pods unless an extreme event occurs, such as complete Node failure or DC power loss. This is the default behavior for Pods.

Long-running GPU Pods

A Pod that holds GPU resources (nvidia.com/gpu) indefinitely without an interruptible label has two consequences:

Deferred maintenance stalls: CKS holds certain maintenance operations, such as removing a Node with elevated InfiniBand link flaps, until the Node becomes idle. If the Pod never stops, those operations never run. Critical hardware failures still trigger immediate action regardless of Pod state.
Prevents active health checks: CoreWeave runs hourly GPU health checks on Nodes where all GPUs are available. A long-running Pod that holds GPUs prevents health checks from scheduling on that Node for the duration of the Pod’s lifetime.

To allow deferred maintenance to proceed, apply the graceful-interruptible or interruptible label to Pods that can tolerate interruption. CKS excludes Pods with these labels when determining Node idle status, so the Node can receive maintenance operations. To allow health checks to run on a Node alongside a long-running GPU Pod, the Pod must not hold any GPUs. SUNK addresses this by introducing a separate sunk.coreweave.com/accelerator resource for Slurm Pods, leaving the Node’s GPUs available for health checks. See SUNK-specific scheduling for details.

Interruptible

Workloads such as inference Pods or stateless applications that you can restart without data loss should use the interruptible strategy. CKS terminates these Pods and proceeds with the Node action (such as reboot or scale-down) without waiting for the Pod’s full terminationGracePeriodSeconds.

Never apply this label to distributed training jobs, single-instance databases, stateful services, or any workload that can’t tolerate sudden termination. Evicting these workloads may cause data loss or require costly restarts across multiple Nodes.

To choose this strategy, apply the label qos.coreweave.cloud/interruptable: "true" to your Pods. This label is in the qos.coreweave.cloud namespace.

metadata:
  labels:
    qos.coreweave.cloud/interruptable: "true"

Note that for historical reasons, the label is misspelled as interruptable.

Gracefully interruptible

Use the gracefully interruptible strategy only for stateful applications that block a reboot until the application terminates, such as databases and local storage. These workloads need to finish writing data or hand off state before a Node reboots or drains. For all other applications, including stateless workloads that can be restarted without data loss, use the interruptible strategy instead. Unlike interruptible Pods, which CKS deletes immediately without waiting, gracefully interruptible Pods block the Node action (including force reboots) until the Pod terminates or its terminationGracePeriodSeconds expires. CKS sends a DELETE call to the Pod and waits for the full grace period before proceeding. To choose this, apply the label qos.coreweave.com/graceful-interruptible: "true" to your Pods. This label is in the qos.coreweave.com namespace.

metadata:
  labels:
    qos.coreweave.com/graceful-interruptible: "true"

Key behaviors and limitations

The following sections describe behaviors and limitations to consider when using graceful-interruptible Pods.

NodePool scale-down

CKS skips Nodes hosting graceful-interruptible Pods when it determines whether the Node is a candidate for removal. This means scale-down may stall if every Pod on candidate Nodes carries this label.

Tolerations that prevent graceful eviction

Pods that tolerate either node.coreweave.cloud/evict=true:NoExecute or node.coreweave.cloud/reserved:NoExecute don’t go through the graceful-interruptible logic and may be evicted immediately. See Eviction taints.

Drain time differences

Reboots and maintenance use a default drain timeout of 3 minutes, and honor terminationGracePeriodSeconds for graceful-interruptible Pods.
CKS-initiated scale-down doesn’t wait for DaemonSets unless they carry qos.coreweave.com/graceful-interruptible: "true" and don’t tolerate eviction taints.
For services that require long termination phases, explicitly set terminationGracePeriodSeconds accordingly.

Potential for scale-down stalls

By design, CKS never removes a Node that contains only graceful-interruptible Pods. If every Pod on a candidate Node carries this label, CKS has nowhere to reclaim capacity and stalls while it waits for Nodes that can be safely drained. In practice, this can block automated scale-down workflows.

Risk of stuck Nodes

If you deploy workloads without accounting for graceful-interruptible semantics, Nodes can remain in a quasi-drained state indefinitely. For example, you may cordon a Node for maintenance, then find it never transitions to “Ready” again because every Pod refuses immediate eviction. Left unchecked, these Nodes consume capacity and can complicate rolling updates. To mitigate these risks, follow these recommendations:

Plan deployment strategies to ensure some interruptable Pods exist as safe eviction candidates for CKS.
Monitor NodePool capacity and scheduling health. Set up alerts on stalled scale-down events or sustained high utilization to detect when graceful-interruptible Pods hold Nodes.
Establish maintenance procedures that include manual intervention steps (for example, draining and deleting problematic Nodes) as a fallback when automated processes can’t reclaim resources.

Weigh the benefits of smoother in-place upgrades against these trade-offs to decide how and when to use graceful-interruptible without compromising cluster resilience or cost efficiency.

Taints and tolerations

CKS uses taints to guard control-plane Nodes and enforce GPU and CPU scheduling. The following sections describe the eviction taints CKS applies to Nodes and the user-facing taints that route Pods to the correct hardware.

Eviction taints

CKS applies the following eviction taints to Nodes:

node.coreweave.cloud/evict=true:NoExecute
node.coreweave.cloud/reserved:NoExecute

Pods that tolerate these bypass graceful eviction and may be evicted immediately.

User taints

Pods without GPU requests automatically tolerate the CPU taint (is_cpu_compute:NoSchedule).

CPU taint

  - effect: NoSchedule
    key: is_cpu_compute

The GPU taint (is_gpu=true:PreferNoSchedule) prevents CPU-only Pods from scheduling on GPU Nodes unless necessary. A CPU-only Pod can still schedule on a GPU Node if no CPU Nodes are available.

GPU taint

  - effect: PreferNoSchedule
    key: is_gpu
    value: "true"

Use caution before adding tolerations to your Pods, so workloads continue to run on healthy Nodes.

SUNK-specific scheduling

The following sections describe scheduling behaviors specific to SUNK workloads on CKS.

The SUNK `/lock` taint

To prevent contention with other Pods that request GPU access while long-running slurmd Pods are active, SUNK adds a new GPU resource to Kubernetes, sunk.coreweave.com/accelerator, in addition to the nvidia.com/gpu resource provided by NVIDIA’s plugin. Because the GPU has two different resource names, Kubernetes tracks the consumption separately, which lets Slurm Pods request the same underlying GPU as other Kubernetes Pods. However, this requires SUNK to manage GPU contention instead of the Kubernetes scheduler. SUNK manages the contention with a taint called sunk.coreweave.com/lock. SUNK applies this taint to Nodes through a call to slurm-syncer during the Prolog phase.

SUNK's lock taint

  - effect: NoExecute
    key: sunk.coreweave.com/lock
    value: "true"

Prolog completion blocks until SUNK evicts every Pod that doesn’t tolerate the taint.

DaemonSets on SUNK Nodes

Kubernetes DaemonSets that run on SUNK Nodes must tolerate the sunk.coreweave.com/lock taint, as well as is_cpu_compute, is_gpu, and node.coreweave.cloud/reserved:

Example toleration

spec:
  tolerations:
  - key: sunk.coreweave.com/lock
    value: "true"
    operator: Equal

  - key: is_cpu_compute
    operator: Exists

  - key: is_gpu
    operator: Exists

Scale down workloads

To scale down Pods in a specific order, use the Cloud Console or CKS API to adjust the cluster specification. For more information, see the Kubernetes guide about Pod deletion cost.

Troubleshoot Pod scheduling

If a GPU Pod stays Pending, see Why are my Pods not scheduling on GPU Nodes? for common causes and fixes.

Resource requests for GPU Pods

Size CPU and memory requests to roughly the per-GPU share of the instance, leaving headroom for DaemonSets and system overhead. Specify GPUs under limits as nvidia.com/gpu: [COUNT]. Kubernetes mirrors the value to requests automatically. For a worked whole-Node example, see Target specific GPUs or CPUs. For long-running training and inference, set CPU and memory requests equal to limits so the Pod gets the Guaranteed Quality-of-Service (QoS) class and is the last to be evicted under resource pressure. See How do Kubernetes QoS classes work?. For recommended ratios and how requests behave on GPU Pods, see What are the recommended resource requests and limits for GPU Pods? and How do CPU and memory requests work with GPU Pods?.

​CKS namespaces

​Node type selection labels

​Node lifecycle labels

​User-provided labels

​Pod interruption and eviction policies

​Summary of eviction strategies

​Non-interruptible

​Long-running GPU Pods

​Interruptible

​Gracefully interruptible

​Key behaviors and limitations

​NodePool scale-down

​Tolerations that prevent graceful eviction

​Drain time differences

​Potential for scale-down stalls

​Risk of stuck Nodes

​Taints and tolerations

​Eviction taints

​User taints

​SUNK-specific scheduling

​The SUNK /lock taint

​DaemonSets on SUNK Nodes

​Scale down workloads

​Troubleshoot Pod scheduling

​Resource requests for GPU Pods

CKS namespaces

Node type selection labels

Node lifecycle labels

User-provided labels

Pod interruption and eviction policies

Summary of eviction strategies

Non-interruptible

Long-running GPU Pods

Interruptible

Gracefully interruptible

Key behaviors and limitations

NodePool scale-down

Tolerations that prevent graceful eviction

Drain time differences

Potential for scale-down stalls

Risk of stuck Nodes

Taints and tolerations

Eviction taints

User taints

SUNK-specific scheduling

The SUNK `/lock` taint

DaemonSets on SUNK Nodes

Scale down workloads

Troubleshoot Pod scheduling

Resource requests for GPU Pods