October 15, 2025 - SUNK v6.9.0 release
Overview
SUNK v6.9.0 introduces automatic job requeueing during rolling upgrades, new configuration options for slurmrestd
, improved resource optimization, and a new command alias.
Automatic requeueing during rolling upgrades
When a node drains for an upgrade, you can now configure jobs to automatically requeue based on specific criteria, including:
- Time-based preemption (
time
): Requeue running jobs on a node with a pending upgrade after a specified amount of time. - Partition-based preemption (
partition
): Requeue running jobs on a node with a pending upgrade that are run on a specified partition. - Quality of Service (QoS) based preemption (
qos
): Requeue running jobs on a node with a pending upgrade that belong to a specified QoS.
These options are available in the config.syncer.nodesetUpdateJobPreemption
parameter of the slurm
Helm chart. See the following sections for examples of how to configure each method.
Partition-based example
To enable partition-based requeueing, edit the following values in syncer.nodesetUpdateJobPreemption
:
- Change the value of
enabled:
totrue
. - Set
method:
topartition
. - Add a
partitions
field and specify the relevant partitions.
The resulting configuration will resemble the following example:
syncer:nodesetUpdateJobPreemption:enabled: truemethod: partitionpartitions: cpu-epyc,all
Time-based example
To enable time-based requeueing, edit the following values in syncer.nodesetUpdateJobPreemption
:
- Change the value of
enabled:
totrue
. - Set
method:
totime
. - Add a
timeLimit
field and set it to your preferred amount of time, in seconds.
The resulting configuration will resemble the following example:
syncer:nodesetUpdateJobPreemption:enabled: truemethod: timetimeLimit: 30s
QoS-based example
To enable requeueing based on QoS, edit the following values in syncer.nodesetUpdateJobPreemption
:
- Change the value of
enabled:
totrue
. - Set
method:
toqos
. - Add a
qos
field and specify
syncer:nodesetUpdateJobPreemption:enabled: truemethod: qosqos: service,normal
Additional customization for slurmrestd
New configuration options are available for slurmrestd
, including options to specify service types, ports, deployment containers, and deployment volumes/volumeMounts.
Expose container ports for metrics scraping
The compute.ports
configuration option has been added to the slurm
Helm chart with a default empty list. This allows for configuration of additional container ports for Compute nodes to support metrics scraping.
Resource optimization
The default CPU and memory resource requests for the slurmrestd
deployment have been reduced, and the number of default replicas has been lowered to one. This change can be overridden in the Helm chart if higher values are required.
New command alias
The undrain-prolog
command alias is now available in the bashrc
file. This command automatically finds and undrains nodes that were placed in a drain
state by the prolog pre-hook mechanism, indicated by the drain reason prolog pre-hook failed
.
Automatic restarts of syncer, scheduler, and SUNK operator deployments on ConfigMap changes
Updates made to the ConfigMap for the syncer, scheduler, or SUNK operator now trigger an automatic restart of the associated components. Manual restarts are no longer required to pick up changes made to the ConfigMap.
Other changes
- Improved retry logic of the
hooksapi