Workload Scheduling on CKS
Control where Pods are deployed through namespaces, labels, and taints
In CoreWeave Kubernetes Service (CKS), Nodes are organized using namespaces, labels, taints, and eviction policies. These features control where CoreWeave's core managed services are scheduled and ensure that customer workloads always run on healthy production Nodes.
CKS namespaces
CKS has two types of namespaces:
- User namespaces are created by customers and labeled with the customer's Org ID.
- Customers have full control over user namespaces - they can create, change, and delete them.
- Control Plane namespaces are created by CoreWeave to host critical services that run within the cluster.
- Customers should not alter or delete these namespaces. CoreWeave workloads are automated within these managed namespaces. Jobs that run in Control Plane namespaces are not billed to the customer.
CKS applies the label ns.coreweave.cloud/org=control-plane
to all Control Plane namespaces. To view these namespaces in a CKS cluster:
$kubectl get namespaces --selector=ns.coreweave.cloud/org=control-plane
Node type selection labels
All CoreWeave Nodes feature Instance IDs. To ensure consistency, all Nodes within a Node Pool are tagged with an Instance ID using the instance-type
label. For example:
node.kubernetes.io/instance-type=instance-type-example
For a list of all Instance IDs, see: Instances.
Some customers may be using the node.kubernetes.io/type
label. This label has been updated to reference the new Instance ID.
Node lifecycle labels
CKS uses labels to identify a Node's state in the Node lifecycle. These labels ensure that customer workloads are always scheduled on healthy production Nodes and CoreWeave's critical core services are scheduled on Control Plane Nodes.
node.coreweave.cloud/state
: Identifies the Node's lifecycle state, such asproduction
,zap
, ortriage
.node.coreweave.cloud/reserved
: Identifies the workload type running on the Node:- If
/reserved
is a customer Org ID and/state=production
, it's for user workloads. - If
/reserved=control-plane
and/state=production
, it's for Control Plane workloads. - If
/state
is notproduction
, than/reserved
matches/state
.
- If
node.coreweave.cloud/reserved-desired
: Overrides/reserved
. If it doesn't match/reserved
, the Node marked is pending and will transition reservations automatically.
User-provided labels
Customers may create custom Node labels for scheduling or organization—but never in the *.coreweave.cloud
or *.coreweave.com namespaces
. Attempts to do so are rejected by CKS.
metadata:labels:foo.coreweave.cloud/bar: "true" # Not allowedfoo.coreweave.com/bar: "true" # Not allowed
Pod interruption and eviction policies
CKS supports three eviction strategies for Pods: non-interruptible, interruptible, and gracefully interruptible. These strategies help manage resource allocation during Node maintenance or scale-down events.
Summary of eviction strategies
Strategy | Pod Label | Description | Note |
---|---|---|---|
Non-interruptible | None (default) | Pods are never evicted during maintenance or scale-down. Use for critical training jobs and single-instance stateful apps. | Default behavior for Nodes, ensuring stability and reliability. |
Interruptible | qos.coreweave.cloud/interruptable | Triggers immediate eviction during maintenance or scale-down. Use for stateless workloads that can be restarted without data loss. | Misspelled as interruptable for historical reasons; in the qos.coreweave.cloud namespace. |
Gracefully interruptible | qos.coreweave.com/graceful-interruptible | Enables graceful termination on maintenance & reboots. Use for stateful applications that can handle graceful shutdowns. | Spelled correctly; in the qos.coreweave.com namespace. |
See the sections below for details about each strategy.
Non-interruptible
Workloads that should not be interrupted, such as critical training jobs and single-instance stateful apps, should not set either of the two labels in the table above. If neither label is set, the Pod is considered non-interruptible.
CKS will not evict these Pods during Node maintenance or scale-down events. They are treated with the highest level of care and will never be evicted unless an extreme event occurs, such as complete Node failure or DC power loss. This is the default behavior for Pods, ensuring they remain stable and unaffected by routine maintenance or scale-down events.
Interruptible
Workloads such as inference Pods or stateless applications that can be restarted without data loss should use the interruptible strategy. This allows CKS to reclaim resources quickly during Node maintenance or scale-down events.
Never apply this label to distributed training jobs, single-instance databases, stateful services, or any workload that cannot tolerate sudden termination. Evicting these workloads may cause data loss or require costly restarts across multiple Nodes.
To choose this strategy, apply the label qos.coreweave.cloud/interruptable: "true"
to your Pods. This label is in the qos.coreweave.cloud
namespace.
metadata:labels:qos.coreweave.cloud/interruptable: "true"
Note that for historical reasons, the label is misspelled as interruptable
.
Gracefully interruptible
Workloads that can exit cleanly before Node maintenance begins should consider using the gracefully interruptible strategy. This is particularly useful for stateful applications that can handle graceful shutdowns, such as replicated databases or stateful services. In this strategy, CKS sends a DELETE
call to the Pod, which respects the Pod's terminationGracePeriodSeconds
setting.
To choose this, apply the label qos.coreweave.com/graceful-interruptible: "true"
to your Pods. This label is in the qos.coreweave.com
namespace.
metadata:labels:qos.coreweave.com/graceful-interruptible: "true"
Key behaviors and limitations
NodePool scale-down
Nodes hosting graceful-interruptible
Pods are skipped when determining if the Node is a candidate for removal. This means scale-down may stall if all Pods on candidate Nodes carry this label.
Tolerations that prevent graceful eviction
Pods that tolerate either node.coreweave.cloud/evict=true:NoExecute
or node.coreweave.cloud/reserved:NoExecute
will not be processed via the graceful-interruptible
logic and may be evicted immediately. See Eviction taints below.
Drain time differences
- Reboots and Maintenance use a default drain timeout of three minutes, and honor
terminationGracePeriodSeconds
forgraceful-interruptible
Pods. - CKS-initiated scale-down does not wait for DaemonSets unless they carry
qos.coreweave.com/graceful-interruptible: "true"
and do not tolerate eviction taints. - Services that need long termination phases should explicitly set
terminationGracePeriodSeconds
accordingly.
Potential for scale-down stalls
By design, CKS will never remove a Node containing only graceful-interruptible
Pods. If every Pod on a candidate Node carries this label, CKS has nowhere to reclaim capacity and will stall waiting for Nodes that can be safely drained. In practice, this can block automated scale-down workflows.
Risk of stuck Nodes
If workloads are deployed without accounting for graceful-interruptible
semantics, Nodes can remain in a quasi-drained state indefinitely. For example, you may cordon a Node for maintenance, then find it never transitions to "Ready" again because all Pods refuse immediate eviction. Left unchecked, these Nodes consume capacity and can complicate rolling updates.
To mitigate these risks:
- Plan deployment strategies to ensure some
interruptable
Pods exist to give CKS safe eviction candidates. - Monitor NodePool capacity and scheduling health. Set up alerts on stalled scale-down events or sustained high utilization to detect when Nodes are being held due to
graceful-interruptible
Pods. - Establish maintenance procedures that include manual intervention steps (e.g., draining and deleting problematic Nodes) as a fallback when automated processes cannot reclaim resources.
By weighing the benefits of smoother in-place upgrades against these trade-offs, teams can decide how and when to use graceful-interruptible
without compromising cluster resilience or cost efficiency.
Taints and tolerations
CKS uses taints to guard control-plane Nodes and enforce GPU/CPU scheduling.
Eviction taints
node.coreweave.cloud/evict=true:NoExecute
node.coreweave.cloud/reserved:NoExecute
Pods tolerating these will bypass graceful eviction and may be evicted immediately.
Control Plane taints
Control Plane Nodes are tainted with a hash of node.coreweave.cloud/reserved
. Pods without matching tolerations are prevented from scheduling.
Nodes in the control-plane
Node Pool are unavailable for customer workloads and are not billable.
When the label value is the customer's Org ID, the /reserved
taint is not applied.
User taints
The CPU taint (is_cpu_compute:NoSchedule
) is automatically tolerated by Pods without GPU requests.
- effect: NoSchedulekey: is_cpu_compute
The GPU taint (is_gpu=true:PreferNoSchedule
) prevents CPU-only Pods from scheduling on GPU Nodes unless necessary. A CPU-only Pod can still schedule on a GPU Node if no CPU Nodes are available.
- effect: PreferNoSchedulekey: is_gpuvalue: "true"
Customers should exercise caution before attempting to add tolerations to their Pods to ensure workloads always run on healthy Nodes.
SUNK-specific scheduling
SUNK's /lock
taint
To prevent contention with other Pods that request GPU access while long-running slurmd
Pods are active, SUNK adds a new GPU resource to Kubernetes, sunk.coreweave.com/accelerator
, in addition to the nvidia.com/gpu
resource provided by NVIDIA's plugin.
Because the GPU has two different resource names, Kubernetes tracks the consumption separately, which allows Slurm Pods to request the same underlying GPU as other Kubernetes Pods. However, this requires SUNK to manage GPU contention instead of the Kubernetes scheduler.
SUNK manages the contention with a taint called sunk.coreweave.com/lock
. SUNK applies this taint to Nodes by making a call to slurm-syncer
during the Prolog phase.
- effect: NoExecutekey: sunk.coreweave.com/lockvalue: "true"
Prolog completion is blocked until all Pods that do not tolerate the taint have been evicted.
DaemonSets on SUNK Nodes
Kubernetes DaemonSets that run on SUNK Nodes must tolerate the sunk.coreweave.com/lock
taint, as well as is_cpu_compute
, is_gpu
, and node.coreweave.cloud/reserved
:
1spec:2tolerations:3- key: sunk.coreweave.com/lock4value: "true"5operator: Equal67- key: is_cpu_compute8operator: Exists910- key: is_gpu11operator: Exists
Scaling down workloads
To scale down Pods in a specific order, use the Cloud Console or CKS API to adjust the cluster specification. For more, see the official Kubernetes guide on Pod deletion cost.