SUNK Pod Scheduler - CoreWeave Docs

This page explains how the SUNK Pod Scheduler integrates Slurm’s scheduling mechanism with Kubernetes so that Pods and Slurm jobs share the same Nodes and lifecycle. Read it to understand how scheduling, preemption, and termination flow between the two systems before you configure workloads that depend on this behavior. The SUNK Pod Scheduler uses the Slurm cluster’s scheduling mechanism to schedule Kubernetes Pods. The scheduler places Pods on the same Kubernetes Nodes where the Slurm jobs run, and it can preempt Kubernetes workloads in favor of Slurm jobs on the same Node, or preempt Slurm jobs in favor of Pods, using the same logic. The SUNK Pod Scheduler also synchronizes Pod and Slurm job creation and deletion between Slurm and Kubernetes. The SUNK Pod Scheduler operates as a conventional reconciler by watching Pod objects and events from the Slurm Controller. It updates the state of the Pod in Kubernetes and the associated Slurm job. For general usage information, see Using the SUNK Pod Scheduler.

Schedule flow

The following description explains how a Pod moves from submission to a running state under the SUNK Pod Scheduler. When a Pod is marked to schedule through the SUNK Pod Scheduler (.spec.schedulerName), the SUNK Pod Scheduler processes the Pod and attempts to schedule it. The SUNK Pod Scheduler validates the Pod to confirm the annotations can be passed to the Slurm Controller through RPC. It doesn’t validate that the values are correct, only that they can be passed. It also verifies that the resource requests are non-zero. Validation errors block the retry of scheduling, and the scheduler creates an event on the Pod for the reason.

A Slurm job isn’t considered running until it completes the prolog stage and the placeholder job script starts. The script appends : started to SLURM_JOB_EXTRA when it starts. The SUNK Pod Scheduler uses Node locking to ensure Nodes are ready for the placeholder jobs. This process is similar to normal Slurm jobs, but with strict checking.

To complete a full scheduling, the reconciler processes the Pod object at least twice. The first pass creates the job and propagates the ID back to the Pod. If the job is in the running state, the second pass schedules the Pod.

Pods can become stuck after binding due to other Kubernetes constraints, such as taints or required affinities on the Kubernetes Node, that the SUNK Pod Scheduler isn’t aware of.

Unschedule flow

The following description explains how the SUNK Pod Scheduler tears down a workload when either side initiates termination. The SUNK Pod Scheduler handles both Pod deletion and Slurm job cancellation flows, so you can stop the workload from either Kubernetes or Slurm. When the placeholder job receives a termination signal, it contacts the SUNK Pod Scheduler’s hook API and blocks until the Pod is deleted. This prevents Slurm from running another job before the Kubernetes Pod is fully deleted. If the Pod is still running just before KillWait is reached, the script places the Node into Drain. This prevents scheduling further workloads on the Node until you resolve the issue. After the KillWait timeout value is reached, Slurm forcibly terminates the job. Because the SUNK Pod Scheduler validates terminationGracePeriodSeconds when scheduling Pods, the Node is unlikely to be drained as a result of the Pod taking too long to delete.

​Schedule flow

​Unschedule flow

Schedule flow

Unschedule flow