Static CPU allocation and the SUNK Pod Scheduler

This page is for cluster operators and workload owners who run SUNK on CoreWeave CKS. It explains how to identify Pods that trigger static CPU allocation, why the issue causes SUNK Nodes to drain, and how to configure Pod resources to prevent it. Follow this guidance to avoid silent Slurm job failures and unexpected Node drains caused by CPU pinning. Kubernetes can pin specific CPU cores to a Pod so that no other process on the Node shares those cores. This feature is called static CPU allocation, and the CPU Manager static policy in the kubelet controls it. CoreWeave CKS Nodes enable this policy by default. Static CPU allocation is useful for latency-sensitive workloads that benefit from dedicated cores. However, it’s incompatible with the SUNK Pod Scheduler because Slurm can’t account for CPU cores that the kubelet has pinned to other Pods.

Do not schedule Pods with CPU Guaranteed QoS on Kubernetes Nodes used for SUNK. These Pods trigger static CPU allocation, which causes resource contention in Slurm. As a result, Slurm Nodes drain with misleading error messages such as batch job complete failure.

Static CPU allocation

A Pod triggers static CPU allocation when both of these conditions are true:

The Pod has Guaranteed QoS class, meaning every container, including init containers, sets CPU and memory requests equal to its limits.
The CPU request is a whole integer (for example, cpu: 4), not a fractional value (for example, cpu: 3.5 or cpu: 750m).

For example, this resource specification triggers static CPU allocation because requests and limits are equal and the CPU value is an integer:

resources:
  requests:
    cpu: "4"
    memory: 8Gi
  limits:
    cpu: "4"
    memory: 10Gi

CPU allocation with SUNK

Slurm is configured with a fixed CPU count for each Node. It doesn’t account for CPUs that the kubelet has pinned to Guaranteed QoS Pods. When static CPU allocation removes cores from the shared pool, the CPUs available to Slurm shrink to the configured total minus whatever the kubelet has pinned. Slurm continues to schedule jobs against the original count, which leads to resource contention. The following sequence shows how the issue occurs:

Kubernetes schedules a Pod with Guaranteed QoS and integer CPU requests onto a SUNK Node.
The CPU Manager in the kubelet pins specific CPU cores to the Pod using cpuset cgroups.
Slurm schedules a job to the Node with pinned resources.
The job encounters resource contention and fails.
Slurm drains the Node because of the job failure.

Common drain reasons

When static CPU allocation causes a job to fail from resource contention, Slurm drains the Node with the reason batch job complete failure. This message can be misleading because it doesn’t point to the underlying CPU pinning by the kubelet.

Prevent Node drains from static CPU allocation

To prevent static CPU allocation from triggering Node drains, ensure that Pods on SUNK Nodes use Burstable QoS instead of Guaranteed QoS. Burstable QoS prevents the kubelet from pinning CPU cores. The following sections describe how to switch Pods to Burstable QoS and verify the configuration.

Switch Pods to Burstable QoS

Set CPU requests lower than CPU limits to change the Pod’s QoS class from Guaranteed to Burstable:

resources:
  requests:
    cpu: "3"
    memory: 8Gi
  limits:
    cpu: "4"
    memory: 8Gi

You can also omit the CPU limits field altogether to produce Burstable QoS. With Burstable QoS in place, the kubelet no longer pins CPU cores, and Slurm can account for the full CPU capacity of the Node.

Verify a Pod’s QoS class

To check whether a Pod has Guaranteed or Burstable QoS, query its status. Replace [POD-NAME] and [NAMESPACE] with your Pod’s name and namespace:

kubectl get pod [POD-NAME] -n [NAMESPACE] -o jsonpath='{.status.qosClass}'

The output should be Burstable, not Guaranteed. To check all Pods on a specific Node, replace [NODE-NAME] with the Node’s name:

kubectl get pods -A --field-selector spec.nodeName=[NODE-NAME] -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.qosClass}{"\n"}{end}'

Look for any Pods with Guaranteed QoS class. If found, update their resource specifications as described in Prevent Node drains from static CPU allocation.

Recover drained Nodes

If this issue has already drained Nodes, address the underlying CPU pinning before the Nodes can return to service. Update the Pod resource specifications so that CPU requests don’t equal CPU limits. The Pod reschedules with Burstable QoS, and the Node should recover. For instructions on undraining Nodes, see Drain and undrain Slurm Nodes. To identify Nodes drained by this issue, look for the drain reason listed in Common drain reasons:

sinfo -t drain -NO "NodeList:45,Reason:130" | grep "batch job complete failure"

​Static CPU allocation

​CPU allocation with SUNK

​Common drain reasons

​Prevent Node drains from static CPU allocation

​Switch Pods to Burstable QoS

​Verify a Pod’s QoS class

​Recover drained Nodes

Static CPU allocation

CPU allocation with SUNK

Common drain reasons

Prevent Node drains from static CPU allocation

Switch Pods to Burstable QoS

Verify a Pod’s QoS class

Recover drained Nodes