April 4, 2025 - SUNK v6.1.0 release
SUNK v6.1.0 released with enhanced monitoring capabilities and infrastructure improvements
Update SUNK SUNK v6.1.0 has been released with enhanced monitoring capabilities and infrastructure improvements.
Overview
SUNK v6.1.0 introduces significant enhancements to monitoring, infrastructure management, and scheduler functionality. This release focuses on improving observability, resource management, and operational efficiency.
Key changes
Charts and infrastructure improvements
In this upgrade, the charts component receives several important updates including the addition of enroot hooks for enhanced container support and improved priority class management. A new priority class is added specifically for the SUNK control plane, while the default priority class for the Slurm control plane is updated for better resource allocation. The GB200 GRES configuration is updated to address SUNK-643, improving GPU resource management.
Enhanced monitoring and metrics
The operator component gains custom metrics for Slurm login monitoring, providing better visibility into login node performance and health. The slurm_node_state
metric now includes partition labels, enabling more granular monitoring and analysis of cluster resources. These improvements allow operators to better track and troubleshoot cluster performance issues.
Scheduler and login management
The scheduler receives a timeout annotation enhancement, improving reliability and preventing hanging operations. The slurm-login
component undergoes significant changes to individual pod login name labels, making them unique for better identification and management. This change requires a restart of individual login pods before any changes or new pods can be applied. Version 6.4.0 updates the reconcile logic, and upgrading to that version or higher allows updates to login pods to be applied individually.
Slurm integration enhancements
The scheduler now exposes Slurm group ID annotations, providing better integration between the Kubernetes scheduler and Slurm workload manager. This enhancement improves the coordination between the two systems and enables more sophisticated scheduling decisions.
Migration notes
When upgrading to SUNK v6.1.0, note that the slurm-login
component changes require a restart of individual login pods before any changes or new pods can be applied. For the best experience with login pod updates, consider upgrading to version 6.4.0 or higher, which includes updated reconcile logic for individual pod management.