Workload Scheduling on CKS

Control where Pods are deployed through namespaces, labels, and taints

In CoreWeave Kubernetes Service (CKS), Nodes are organized using namespaces, labels, taints, and eviction policies. These features control where CoreWeave's core managed services are scheduled and ensure that customer workloads always run on healthy production Nodes.

CKS namespaces

CKS has two types of namespaces:

User namespaces are created by customers and labeled with the customer's Org ID.
- Customers have full control over user namespaces - they can create, change, and delete them.
Control Plane namespaces are created by CoreWeave to host critical services that run within the cluster.
- Customers should not alter or delete these namespaces. CoreWeave workloads are automated within these managed namespaces. Jobs that run in Control Plane namespaces are not billed to the customer.

CKS applies the label ns.coreweave.cloud/org=control-plane to all Control Plane namespaces. To view these namespaces in a CKS cluster:

$
kubectl get namespaces --selector=ns.coreweave.cloud/org=control-plane

Node type selection labels

All CoreWeave Nodes feature Instance IDs. To ensure consistency, all Nodes within a Node Pool are tagged with an Instance ID using the instance-type label. For example:

node.kubernetes.io/instance-type=instance-type-example

For a list of all Instance IDs, see: Instances.

Note

Some customers may be using the node.kubernetes.io/type label. This label has been updated to reference the new Instance ID.

Node lifecycle labels

CKS uses labels to identify a Node's state in the Node lifecycle. These labels ensure that customer workloads are always scheduled on healthy production Nodes and CoreWeave's critical core services are scheduled on Control Plane Nodes.

node.coreweave.cloud/state: Identifies the Node's lifecycle state, such as production, zap, or triage.
node.coreweave.cloud/reserved: Identifies the workload type running on the Node:
- If /reserved is a customer Org ID and /state=production, it's for user workloads.
- If /reserved=control-plane and /state=production, it's for Control Plane workloads.
- If /state is not production, than /reserved matches /state.
node.coreweave.cloud/reserved-desired: Overrides /reserved. If it doesn't match /reserved, the Node marked is pending and will transition reservations automatically.

User-provided labels

Customers may create custom Node labels for scheduling or organization—but never in the *.coreweave.cloud or *.coreweave.com namespaces. Attempts to do so are rejected by CKS.

metadata:
  labels:
    foo.coreweave.cloud/bar: "true"   # Not allowed
    foo.coreweave.com/bar: "true"     # Not allowed

Pod interruption and eviction policies

CKS supports three eviction strategies for Pods: non-interruptible, interruptible, and gracefully interruptible. These strategies help manage resource allocation during Node maintenance or scale-down events.

Summary of eviction strategies

Strategy	Pod Label	Description	Note
Non-interruptible	None (default)	Pods are never evicted during maintenance or scale-down. Use for critical training jobs and single-instance stateful apps.	Default behavior for Nodes, ensuring stability and reliability.
Interruptible	`qos.coreweave.cloud/interruptable`	Triggers immediate eviction during maintenance or scale-down. Use for stateless workloads that can be restarted without data loss.	Misspelled as `interruptable` for historical reasons; in the `qos.coreweave.cloud` namespace.
Gracefully interruptible	`qos.coreweave.com/graceful-interruptible`	Enables graceful termination on maintenance & reboots. Use for stateful applications that can handle graceful shutdowns.	Spelled correctly; in the `qos.coreweave.com` namespace.

See the sections below for details about each strategy.

Non-interruptible

Workloads that should not be interrupted, such as critical training jobs and single-instance stateful apps, should not set either of the two labels in the table above. If neither label is set, the Pod is considered non-interruptible.

CKS will not evict these Pods during Node maintenance or scale-down events. They are treated with the highest level of care and will never be evicted unless an extreme event occurs, such as complete Node failure or DC power loss. This is the default behavior for Pods, ensuring they remain stable and unaffected by routine maintenance or scale-down events.

Interruptible

Workloads such as inference Pods or stateless applications that can be restarted without data loss should use the interruptible strategy. This allows CKS to reclaim resources quickly during Node maintenance or scale-down events.

Never apply this label to distributed training jobs, single-instance databases, stateful services, or any workload that cannot tolerate sudden termination. Evicting these workloads may cause data loss or require costly restarts across multiple Nodes.

To choose this strategy, apply the label qos.coreweave.cloud/interruptable: "true" to your Pods. This label is in the qos.coreweave.cloud namespace.

metadata:
  labels:
    qos.coreweave.cloud/interruptable: "true"

Note that for historical reasons, the label is misspelled as interruptable.

Gracefully interruptible

Workloads that can exit cleanly before Node maintenance begins should consider using the gracefully interruptible strategy. This is particularly useful for stateful applications that can handle graceful shutdowns, such as replicated databases or stateful services. In this strategy, CKS sends a DELETE call to the Pod, which respects the Pod's terminationGracePeriodSeconds setting.

To choose this, apply the label qos.coreweave.com/graceful-interruptible: "true" to your Pods. This label is in the qos.coreweave.com namespace.

metadata:
  labels:
    qos.coreweave.com/graceful-interruptible: "true"

Key behaviors and limitations

NodePool scale-down

Nodes hosting graceful-interruptible Pods are skipped when determining if the Node is a candidate for removal. This means scale-down may stall if all Pods on candidate Nodes carry this label.

Tolerations that prevent graceful eviction

Pods that tolerate either node.coreweave.cloud/evict=true:NoExecute or node.coreweave.cloud/reserved:NoExecute will not be processed via the graceful-interruptible logic and may be evicted immediately. See Eviction taints below.

Drain time differences

Reboots and Maintenance use a default drain timeout of three minutes, and honor terminationGracePeriodSeconds for graceful-interruptible Pods.
CKS-initiated scale-down does not wait for DaemonSets unless they carry qos.coreweave.com/graceful-interruptible: "true" and do not tolerate eviction taints.
Services that need long termination phases should explicitly set terminationGracePeriodSeconds accordingly.

Potential for scale-down stalls

By design, CKS will never remove a Node containing only graceful-interruptible Pods. If every Pod on a candidate Node carries this label, CKS has nowhere to reclaim capacity and will stall waiting for Nodes that can be safely drained. In practice, this can block automated scale-down workflows.

Risk of stuck Nodes

If workloads are deployed without accounting for graceful-interruptible semantics, Nodes can remain in a quasi-drained state indefinitely. For example, you may cordon a Node for maintenance, then find it never transitions to "Ready" again because all Pods refuse immediate eviction. Left unchecked, these Nodes consume capacity and can complicate rolling updates.

To mitigate these risks:

Plan deployment strategies to ensure some interruptable Pods exist to give CKS safe eviction candidates.
Monitor NodePool capacity and scheduling health. Set up alerts on stalled scale-down events or sustained high utilization to detect when Nodes are being held due to graceful-interruptible Pods.
Establish maintenance procedures that include manual intervention steps (e.g., draining and deleting problematic Nodes) as a fallback when automated processes cannot reclaim resources.

By weighing the benefits of smoother in-place upgrades against these trade-offs, teams can decide how and when to use graceful-interruptible without compromising cluster resilience or cost efficiency.

Taints and tolerations

CKS uses taints to guard control-plane Nodes and enforce GPU/CPU scheduling.

Eviction taints

node.coreweave.cloud/evict=true:NoExecute
node.coreweave.cloud/reserved:NoExecute

Pods tolerating these will bypass graceful eviction and may be evicted immediately.

Control Plane taints

Control Plane Nodes are tainted with a hash of node.coreweave.cloud/reserved. Pods without matching tolerations are prevented from scheduling.

Nodes in the control-plane Node Pool are unavailable for customer workloads and are not billable.

Note

When the label value is the customer's Org ID, the /reserved taint is not applied.

User taints

The CPU taint (is_cpu_compute:NoSchedule) is automatically tolerated by Pods without GPU requests.

CPU taint

  - effect: NoSchedule
    key: is_cpu_compute

The GPU taint (is_gpu=true:PreferNoSchedule) prevents CPU-only Pods from scheduling on GPU Nodes unless necessary. A CPU-only Pod can still schedule on a GPU Node if no CPU Nodes are available.

GPU taint

  - effect: PreferNoSchedule
    key: is_gpu
    value: "true"

Danger

Customers should exercise caution before attempting to add tolerations to their Pods to ensure workloads always run on healthy Nodes.

SUNK-specific scheduling

SUNK's `/lock` taint

To prevent contention with other Pods that request GPU access while long-running slurmd Pods are active, SUNK adds a new GPU resource to Kubernetes, sunk.coreweave.com/accelerator, in addition to the nvidia.com/gpu resource provided by NVIDIA's plugin.

Because the GPU has two different resource names, Kubernetes tracks the consumption separately, which allows Slurm Pods to request the same underlying GPU as other Kubernetes Pods. However, this requires SUNK to manage GPU contention instead of the Kubernetes scheduler.

SUNK manages the contention with a taint called sunk.coreweave.com/lock. SUNK applies this taint to Nodes by making a call to slurm-syncer during the Prolog phase.

SUNK's lock taint

  - effect: NoExecute
    key: sunk.coreweave.com/lock
    value: "true"

Important

Prolog completion is blocked until all Pods that do not tolerate the taint have been evicted.

DaemonSets on SUNK Nodes

Kubernetes DaemonSets that run on SUNK Nodes must tolerate the sunk.coreweave.com/lock taint, as well as is_cpu_compute, is_gpu, and node.coreweave.cloud/reserved:

Example toleration

1spec:
2  tolerations:
3  - key: sunk.coreweave.com/lock
4    value: "true"
5    operator: Equal
6
7  - key: is_cpu_compute
8    operator: Exists
9
10  - key: is_gpu
11    operator: Exists

Scaling down workloads

To scale down Pods in a specific order, use the Cloud Console or CKS API to adjust the cluster specification. For more, see the official Kubernetes guide on Pod deletion cost.

CKS namespaces​

Node type selection labels​

Node lifecycle labels​

User-provided labels​

Pod interruption and eviction policies​

Summary of eviction strategies​

Non-interruptible​

Interruptible​

Gracefully interruptible​

Key behaviors and limitations​

NodePool scale-down​

Tolerations that prevent graceful eviction​

Drain time differences​

Potential for scale-down stalls​

Risk of stuck Nodes​

Taints and tolerations​

Eviction taints​

Control Plane taints​

User taints​

SUNK-specific scheduling​

SUNK's /lock taint​

DaemonSets on SUNK Nodes​

Scaling down workloads​