The SUNK Pod Scheduler uses the Slurm cluster’s scheduling mechanism to schedule Kubernetes Pods. The SUNK Pod Scheduler can preempt Kubernetes workloads in favor of Slurm jobs running on the same Node, or vice-versa, using the same logic. The Pods are scheduled on the same Kubernetes Nodes where the Slurm jobs are running. The SUNK Pod Scheduler also facilities synchronization of Pod and Slurm job creation and deletion between Slurm and Kubernetes. The SUNK Pod Scheduler operates as a conventional reconciler, by watching Pod objects and events from the Slurm Controller. It updates the state of the Pod in Kubernetes and the associated Slurm job. For general usage information, see Using the SUNK Pod Scheduler.Documentation Index
Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
Use this file to discover all available pages before exploring further.
Schedule flow
When a Pod is marked to schedule via the SUNK Pod Scheduler (.spec.schedulerName), the SUNK Pod Scheduler will process the Pod and attempt to schedule it. The SUNK Pod Scheduler will validate the Pod to make sure the annotations are valid to pass to the Slurm Controller via RPC. It does not validate that the values are correct, only that they are valid to pass. It also verifies that the resource requests are non-zero.
Errors from validation will block the retry of scheduling and an event is created on the Pod for the reason.
A Slurm job is not considered running until it has completed the prolog stage and the placeholder job script has started. The script appends
: started to SLURM_JOB_EXTRA when it starts. The SUNK Pod Scheduler uses node locking to ensure nodes are ready for the placeholder jobs. This process is similar to normal Slurm jobs, but with strict checking.Pods may become stuck after binding due to other Kubernetes items, like taints or required affinities for the Kubernetes Node, that the SUNK Pod Scheduler is not aware of.
Unschedule flow
The SUNK Pod Scheduler is capable of handling both Pod deletion and Slurm job cancellation flows, so the workload can be stopped from either Kubernetes or Slurm. When a termination signal is received, the placeholder job contacts the SUNK Pod Scheduler’s hook API and blocks until the Pod is deleted. This prevents Slurm from running another job before the Kubernetes Pod is fully deleted. If the Pod is still running, just beforeKillWait is reached, the script will place the node into Drain. This will prevent scheduling further workloads on the node until the issue has been resolved. After the KillWait timeout value is reached, Slurm forcibly kills the job.
Because Pods scheduled via the SUNK Pod Scheduler have the terminationGracePeriodSeconds validated when scheduling, the Node is unlikely to be drained as a result of the Pod taking too long to delete.