Skip to main content

April 4, 2025 - SUNK v6.1.0 release

SUNK v6.1.0 released with enhanced monitoring capabilities and infrastructure improvements

Update SUNK SUNK v6.1.0 has been released with enhanced monitoring capabilities and infrastructure improvements.

Overview

SUNK v6.1.0 introduces significant enhancements to monitoring, infrastructure management, and scheduler functionality. This release focuses on improving observability, resource management, and operational efficiency.

Key changes

Charts and infrastructure improvements

In this upgrade, the charts component receives several important updates including the addition of enroot hooks for enhanced container support and improved priority class management. A new priority class is added specifically for the SUNK control plane, while the default priority class for the Slurm control plane is updated for better resource allocation. The GB200 GRES configuration is updated to address SUNK-643, improving GPU resource management.

Enhanced monitoring and metrics

The operator component gains custom metrics for Slurm login monitoring, providing better visibility into login node performance and health. The slurm_node_state metric now includes partition labels, enabling more granular monitoring and analysis of cluster resources. These improvements allow operators to better track and troubleshoot cluster performance issues.

Scheduler and login management

The scheduler receives a timeout annotation enhancement, improving reliability and preventing hanging operations. The slurm-login component undergoes significant changes to individual pod login name labels, making them unique for better identification and management. This change requires a restart of individual login pods before any changes or new pods can be applied. Version 6.4.0 updates the reconcile logic, and upgrading to that version or higher allows updates to login pods to be applied individually.

Slurm integration enhancements

The scheduler now exposes Slurm group ID annotations, providing better integration between the Kubernetes scheduler and Slurm workload manager. This enhancement improves the coordination between the two systems and enables more sophisticated scheduling decisions.

Migration notes

When upgrading to SUNK v6.1.0, note that the slurm-login component changes require a restart of individual login pods before any changes or new pods can be applied. For the best experience with login pod updates, consider upgrading to version 6.4.0 or higher, which includes updated reconcile logic for individual pod management.

Additional resources