Operation overview
SUNK contains a built-in prolog script that verifies the Kubernetes Nodes are ready for Slurm Jobs, and locks the Nodes from accepting additional Kubernetes workloads before the job starts. The prolog script calls the Syncer’s hook API, which then triggers a set of state updates through the Pod Controller and Node Controller to complete the process. Both normal Slurm jobs and placeholder jobs created by the SUNK Pod Scheduler use this prolog script.Node locking taints and annotations
In this process, SUNK uses Kubernetes taints and annotations to facilitate the desired behavior. The taint key used on the Node issunk.coreweave.com/lock.
The Pod annotation uses the same key as the Node taint, with the following values:
falsependinglockedlocked_strict
Locking
During locking, two different operations proceed at the same time:- SUNK propagates the Slurm node state to the Kubernetes Node and updates the lock, as described in the operation overview.
- The Syncer pre-hook performs a blocking check of the lock state.