Skip to main content

Workload Scheduling on CKS

Control where Pods are deployed through namespaces, labels, and taints

In CoreWeave Kubernetes Service (CKS), Nodes are organized using namespaces, labels, and taints. These features control where CoreWeave's core managed services are scheduled and ensure that customer workloads always run on healthy production Nodes.

CKS namespaces

CKS has two types of namespaces:

  • User namespaces are created by customers and labeled with the customer's Org ID.
    • Customers have full control over user namespaces - they can create, change, and delete them.
  • Control plane namespaces are created by CoreWeave to host critical services that run within the cluster.
    • Customers should not alter or delete these namespaces. CoreWeave workloads are automated within these managed via automation in these namespaces. Jobs that run in control plane namespaces are not billed to the customer.

CKS applies the label ns.coreweave.cloud/org: control-plane to all control plane namespaces. To view these namespaces in a CKS cluster, select them using kubectl with the --selector option:

Example
$
kubectl get namespaces --selector=ns.coreweave.cloud/org=control-plane

Output:

Example
NAME STATUS AGE
hpc-verification Active 19d
kube-system Active 22d
node-problem-detector Active 19d

Node type selection labels

All CoreWeave Nodes feature Instance IDs. To ensure consistency, all Nodes within a Node Pool are tagged with an Instance ID using the instance-type label. For example, given a Node Pool comprised of Node type example, all Nodes in that Node Pool feature the label node.kubernetes.io/instance-type=instance-type-example. For a list of all Instance IDs, see: Instances.

Note

Some customers may be using the node.kubernetes.io/type label. This label has also been updated to reference the new Node Instance ID. For example, node.kubernetes.io/type=H100_NVLINK_80GB.8.xeon.128 will now return the value node.kubernetes.io/type=gd-8xh100ib-i128.

Node lifecycle labels

CKS uses labels to identify a Node's state in the Node lifecycle. These labels ensure that customer workloads are always scheduled on healthy production Nodes, and that CoreWeave's critical core services are always scheduled on control plane Nodes.

  • node.coreweave.cloud/state: Identifies the Node's state in the lifecycle, such as production, zap, or triage.
  • node.coreweave.cloud/reserved: Identifies the type of workloads that can run on the Node:
    • If the Node is reserved for user workloads and /state is production, the value is the customer's Org ID.
    • If the Node is reserved for control plane workloads and /state is production, the value is control-plane.
    • If /state is not production, the value is the same as /state.
  • node.coreweave.cloud/reserved-desired: Overrides a Node's reserved value. If /reserved-desired does not match the /reserved label, the Node is pending. This procedure moves Nodes from the organization's reservation to the control-plane reservation to prevent workloads from being scheduled on pending Nodes.

User-provided labels

Customers can create custom Node labels for their own purposes, such as scheduling or organization.

However, new labels cannot be created in the *.coreweave.cloud or *.coreweave.com namespaces. Attempts to create labels in these namespaces are rejected by CKS. For example, the following labels are not allowed:

Example
metadata:
labels:
foo.coreweave.cloud/bar: "true"
foo.coreweave.com/bar: "true"
Namespaces `*.coreweave.cloud` or `*.coreweave.com` are not allowed.

Interruptable Pods

Pods with jobs that can be interrupted, such as inference jobs, should set the label qos.coreweave.cloud/interruptable=true:

Example
metadata:
labels:
qos.coreweave.cloud/interruptable: "true"

This label allows CKS to evict Pods to ensure that critical workloads, such as training jobs, run without interruption.

The label signals to CKS that the Pod can be evicted when needed, such as when the Node needs to apply upgrades or if it fails a health check. This label is also useful for inference Pods, which operate independently, and are stateless. If a Node needs to evict an inference Pod, it allows the request in progress to finish before shutting down, without impacting the end user.

However, this label should not be set for distributed training use cases, as it can cause a job to be prematurely terminated, interrupting the entire training run's progress since the last checkpoint, and forcing a job that potentially spans thousands of Nodes to be restarted. Single-instance databases are another example of jobs that should not set this label because they cannot be restarted without disruption or potential data loss.

Taints

Control plane taints

CKS uses taints in control plane namespaces to control where Pods are scheduled. The taint value is a hash of the /reserved label, such as control-plane or triage.

Info

Nodes inside the control-plane namespace are unavailable for customer workloads, and are not billable to the customer.

User taints

CKS applies taints to all Nodes, including production Nodes, in user namespaces.

Danger

Customers should exercise caution before attempting to add tolerations to their Pods to ensure workloads always run on healthy nodes.

CPU taints

The CKS admission policy automatically adds the is_cpu_compute toleration to any Pod that does not request a GPU resource. This ensures that these Pods are only scheduled on CPU-only Nodes.

CPU taint
- effect: NoSchedule
key: is_cpu_compute

GPU taints

GPU Nodes have the is_gpu=true:PreferNoSchedule taint to prevent Pods that request CPU-only resources from scheduling onto GPU Nodes, unless the Pods are required. A CPU-only Pod can still schedule on a GPU Node if no CPU Nodes are available.

GPU taint
- effect: PreferNoSchedule
key: is_gpu
value: "true"

SUNK-specific scheduling

SUNK's /lock taint

To prevent contention with other Pods that request GPU access while long-running slurmd Pods are active, SUNK adds a new GPU resource to Kubernetes, sunk.coreweave.com/accelerator, in addition to the nvidia.com/gpu resource provided by NVIDIA's plugin.

Because the GPU has two different resource names, Kubernetes tracks the consumption separately, which allows Slurm Pods to request the same underlying GPU as other Kubernetes Pods. However, this requires SUNK to manage GPU contention instead of the Kubernetes scheduler.

SUNK manages the contention with a taint called sunk.coreweave.com/lock. SUNK applies this taint to Nodes by making a call to slurm-syncer during the prolog phase.

SUNK's lock taint
- effect: NoExecute
key: sunk.coreweave.com/lock
value: "true"
Important

Prolog completion is blocked until all Pods that do not tolerate the taint have been evicted.

DaemonSets on SUNK Nodes

Kubernetes DaemonSets that run on SUNK Nodes must tolerate the sunk.coreweave.com/lock taint, as well as is_cpu_compute, is_gpu, and node.coreweave.cloud/reserved:

Example toleration
1
spec:
2
tolerations:
3
- key: sunk.coreweave.com/lock
4
value: "true"
5
operator: Equal
6
7
- key: is_cpu_compute
8
operator: Exists
9
10
- key: is_gpu
11
operator: Exists
12
13
- key: node.coreweave.cloud/reserved
14
value: <ORG_ID_HASH_VALUE>
15
operator: Equal

Best practice: scaling down

Customers wishing to scale down Pods in a specific order can do so by annotating their Cluster manifest. For more information on scaling down resources in general, see the official Kubernetes documentation.