Workload Scheduling on CKS
Control where Pods are deployed through namespaces, labels, and taints
In CoreWeave Kubernetes Service (CKS), Nodes are organized using namespaces, labels, and taints. These features control where CoreWeave's core managed services are scheduled and ensure that customer workloads always run on healthy production Nodes.
CKS namespaces
CKS has two types of namespaces:
- User namespaces are created by customers and labeled with the customer's Org ID.
- Customers have full control over user namespaces - they can create, change, and delete them.
- Control plane namespaces are created by CoreWeave to host critical services that run within the cluster.
- Customers should not alter or delete these namespaces. CoreWeave workloads are automated within these managed via automation in these namespaces. Jobs that run in control plane namespaces are not billed to the customer.
CKS applies the label ns.coreweave.cloud/org: control-plane
to all control plane namespaces. To view these namespaces in a CKS cluster, select them using kubectl
with the --selector
option:
$kubectl get namespaces --selector=ns.coreweave.cloud/org=control-plane
Output:
NAME STATUS AGEhpc-verification Active 19dkube-system Active 22dnode-problem-detector Active 19d
Node type selection labels
All CoreWeave Nodes feature Instance IDs. To ensure consistency, all Nodes within a Node Pool are tagged with an Instance ID using the instance-type
label. For example, given a Node Pool comprised of Node type example
, all Nodes in that Node Pool feature the label node.kubernetes.io/instance-type=instance-type-example
. For a list of all Instance IDs, see: Instances.
Some customers may be using the node.kubernetes.io/type
label. This label has also been updated to reference the new Node Instance ID. For example, node.kubernetes.io/type=H100_NVLINK_80GB.8.xeon.128
will now return the value node.kubernetes.io/type=gd-8xh100ib-i128
.
Node lifecycle labels
CKS uses labels to identify a Node's state in the Node lifecycle. These labels ensure that customer workloads are always scheduled on healthy production Nodes, and that CoreWeave's critical core services are always scheduled on control plane Nodes.
node.coreweave.cloud/state
: Identifies the Node's state in the lifecycle, such asproduction
,zap
, ortriage
.node.coreweave.cloud/reserved
: Identifies the type of workloads that can run on the Node:- If the Node is reserved for user workloads and
/state
isproduction
, the value is the customer's Org ID. - If the Node is reserved for control plane workloads and
/state
isproduction
, the value iscontrol-plane
. - If
/state
is notproduction
, the value is the same as/state
.
- If the Node is reserved for user workloads and
node.coreweave.cloud/reserved-desired
: Overrides a Node'sreserved
value. If/reserved-desired
does not match the/reserved
label, the Node is pending. This procedure moves Nodes from the organization's reservation to thecontrol-plane
reservation to prevent workloads from being scheduled on pending Nodes.
User-provided labels
Customers can create custom Node labels for their own purposes, such as scheduling or organization.
However, new labels cannot be created in the *.coreweave.cloud
or *.coreweave.com
namespaces. Attempts to create labels in these namespaces are rejected by CKS. For example, the following labels are not allowed:
metadata:labels:foo.coreweave.cloud/bar: "true"foo.coreweave.com/bar: "true"Namespaces `*.coreweave.cloud` or `*.coreweave.com` are not allowed.
Interruptable Pods
Pods with jobs that can be interrupted, such as inference jobs, should set the label qos.coreweave.cloud/interruptable=true
:
metadata:labels:qos.coreweave.cloud/interruptable: "true"
This label allows CKS to evict Pods to ensure that critical workloads, such as training jobs, run without interruption.
The label signals to CKS that the Pod can be evicted when needed, such as when the Node needs to apply upgrades or if it fails a health check. This label is also useful for inference Pods, which operate independently, and are stateless. If a Node needs to evict an inference Pod, it allows the request in progress to finish before shutting down, without impacting the end user.
However, this label should not be set for distributed training use cases, as it can cause a job to be prematurely terminated, interrupting the entire training run's progress since the last checkpoint, and forcing a job that potentially spans thousands of Nodes to be restarted. Single-instance databases are another example of jobs that should not set this label because they cannot be restarted without disruption or potential data loss.
Taints
Control plane taints
CKS uses taints in control plane namespaces to control where Pods are scheduled. The taint value is a hash of the /reserved
label, such as control-plane
or triage
.
Nodes inside the control-plane
namespace are unavailable for customer workloads, and are not billable to the customer.
User taints
CKS applies taints to all Nodes, including production Nodes, in user namespaces.
Customers should exercise caution before attempting to add tolerations to their Pods to ensure workloads always run on healthy nodes.
CPU taints
The CKS admission policy automatically adds the is_cpu_compute
toleration to any Pod that does not request a GPU resource. This ensures that these Pods are only scheduled on CPU-only Nodes.
- effect: NoSchedulekey: is_cpu_compute
GPU taints
GPU Nodes have the is_gpu=true:PreferNoSchedule
taint to prevent Pods that request CPU-only resources from scheduling onto GPU Nodes, unless the Pods are required. A CPU-only Pod can still schedule on a GPU Node if no CPU Nodes are available.
- effect: PreferNoSchedulekey: is_gpuvalue: "true"
SUNK-specific scheduling
SUNK's /lock
taint
To prevent contention with other Pods that request GPU access while long-running slurmd
Pods are active, SUNK adds a new GPU resource to Kubernetes, sunk.coreweave.com/accelerator
, in addition to the nvidia.com/gpu
resource provided by NVIDIA's plugin.
Because the GPU has two different resource names, Kubernetes tracks the consumption separately, which allows Slurm Pods to request the same underlying GPU as other Kubernetes Pods. However, this requires SUNK to manage GPU contention instead of the Kubernetes scheduler.
SUNK manages the contention with a taint called sunk.coreweave.com/lock
. SUNK applies this taint to Nodes by making a call to slurm-syncer
during the prolog phase.
- effect: NoExecutekey: sunk.coreweave.com/lockvalue: "true"
Prolog completion is blocked until all Pods that do not tolerate the taint have been evicted.
DaemonSets on SUNK Nodes
Kubernetes DaemonSets that run on SUNK Nodes must tolerate the sunk.coreweave.com/lock
taint, as well as is_cpu_compute
, is_gpu
, and node.coreweave.cloud/reserved
:
1spec:2tolerations:3- key: sunk.coreweave.com/lock4value: "true"5operator: Equal67- key: is_cpu_compute8operator: Exists910- key: is_gpu11operator: Exists1213- key: node.coreweave.cloud/reserved14value: <ORG_ID_HASH_VALUE>15operator: Equal
Best practice: scaling down
Customers wishing to scale down Pods in a specific order can do so by annotating their Cluster manifest. For more information on scaling down resources in general, see the official Kubernetes documentation.