October 15, 2025 - SUNK v6.9.0 release

Update SUNK

Overview

SUNK v6.9.0 introduces automatic job requeueing during rolling upgrades, new configuration options for slurmrestd, improved resource optimization, and a new command alias.

Automatic requeueing during rolling upgrades

When a node drains for an upgrade, you can now configure jobs to automatically requeue based on specific criteria, including:

Time-based preemption (time): Requeue running jobs on a node with a pending upgrade after a specified amount of time.
Partition-based preemption (partition): Requeue running jobs on a node with a pending upgrade that are run on a specified partition.
Quality of Service (QoS) based preemption (qos): Requeue running jobs on a node with a pending upgrade that belong to a specified QoS.

These options are available in the config.syncer.nodesetUpdateJobPreemption parameter of the slurm Helm chart. See the following sections for examples of how to configure each method.

Partition-based example

To enable partition-based requeueing, edit the following values in syncer.nodesetUpdateJobPreemption:

Change the value of enabled: to true.
Set method: to partition.
Add a partitions field and specify the relevant partitions.

The resulting configuration will resemble the following example:

Example

syncer:
  nodesetUpdateJobPreemption:
    enabled: true
    method: partition
    partitions: cpu-epyc,all

Time-based example

To enable time-based requeueing, edit the following values in syncer.nodesetUpdateJobPreemption:

Change the value of enabled: to true.
Set method: to time.
Add a timeLimit field and set it to your preferred amount of time, in seconds.

The resulting configuration will resemble the following example:

Example

syncer:
  nodesetUpdateJobPreemption:
    enabled: true
    method: time
    timeLimit: 30s

QoS-based example

To enable requeueing based on QoS, edit the following values in syncer.nodesetUpdateJobPreemption:

Change the value of enabled: to true.
Set method: to qos.
Add a qos field and specify

Example


syncer:
  nodesetUpdateJobPreemption:
    enabled: true
    method: qos
    qos: service,normal

Additional customization for `slurmrestd`

New configuration options are available for slurmrestd, including options to specify service types, ports, deployment containers, and deployment volumes/volumeMounts.

Expose container ports for metrics scraping

The compute.ports configuration option has been added to the slurm Helm chart with a default empty list. This allows for configuration of additional container ports for Compute nodes to support metrics scraping.

Resource optimization

The default CPU and memory resource requests for the slurmrestd deployment have been reduced, and the number of default replicas has been lowered to one. This change can be overridden in the Helm chart if higher values are required.

New command alias

The undrain-prolog command alias is now available in the bashrc file. This command automatically finds and undrains nodes that were placed in a drain state by the prolog pre-hook mechanism, indicated by the drain reason prolog pre-hook failed.

Automatic restarts of syncer, scheduler, and SUNK operator deployments on ConfigMap changes

Updates made to the ConfigMap for the syncer, scheduler, or SUNK operator now trigger an automatic restart of the associated components. Manual restarts are no longer required to pick up changes made to the ConfigMap.

New registry and image tag formats for Slurm images

Slurm images have moved from registry.gitlab.com/coreweave/sunk and are now hosted on our new registry at docker.artifacts.coreweave.com/slurm-containers-public. Starting with this release, the image tagging convention has changed to be tighter aligned with Slurm versions. The new image tagging format will be in the following format, <base Slurm version>-coreweave.<CoreWeave patch version>-<base OS version>. For example, this release uses 24.11.5-coreweave.1-ubuntu22.04, so components like the login pod, slurm control plane, and slurm compute nodes will use this registry and image format.

SUNK images will remain at registry.gitlab.com/coreweave/sunk, so components like the sunk controller, syncer, and scheduler will continue to use this registry and image format.

Other changes

Improved retry logic of the hooksapi

Overview​

Automatic requeueing during rolling upgrades​

Partition-based example​

Time-based example​

QoS-based example​

Additional customization for slurmrestd​

Expose container ports for metrics scraping​

Resource optimization​

New command alias​

Automatic restarts of syncer, scheduler, and SUNK operator deployments on ConfigMap changes​

New registry and image tag formats for Slurm images​

Other changes​

Overview

Automatic requeueing during rolling upgrades

Partition-based example

Time-based example

QoS-based example

Additional customization for `slurmrestd`

Expose container ports for metrics scraping

Resource optimization

New command alias

Automatic restarts of syncer, scheduler, and SUNK operator deployments on ConfigMap changes

New registry and image tag formats for Slurm images

Other changes