June 13, 2025 - SUNK v6.5.0 release
SUNK v6.5.0 released with Slurm upgrades, CUDA image support, and enhanced container awareness features
Update SUNK SUNK v6.5.0 has been released with significant Slurm upgrades, new CUDA image support, and enhanced container awareness features for improved workload management and system stability.
Overview
SUNK v6.5.0 introduces major improvements to the Slurm workload manager, expands CUDA image support, and adds new container awareness capabilities. This release focuses on enhancing the user experience for both traditional HPC workloads and modern containerized applications while maintaining backward compatibility.
Key changes
Slurm workload manager upgrades
Slurm upgrade to 24.11.05: The core Slurm installation has been upgraded to version 24.11.05.
Backport Slurm 25.05 features: Several key features from Slurm 25.05 have been backported, including enhanced scheduling algorithms and improved resource management capabilities that provide better performance for complex workload scenarios.
CUDA image support expansion
New CUDA images for 12.8.1 and 12.9.0: SUNK v6.5.0 now supports CUDA images for versions 12.8.1 and 12.9.0, enabling users to leverage the latest NVIDIA CUDA capabilities for machine learning, scientific computing, and other GPU-accelerated workloads.
Enhanced container awareness
SlurmdSpecOverride and container awareness features: New container awareness features have been implemented that correctly configure CPUSpecList and MemSpecList, allowing static pod workloads to run seamlessly without entering an invalid state after an scontrol reconfigure
operation.
Improved pod management: The system now provides better integration between Slurm job scheduling and Kubernetes pod management, ensuring that containerized workloads receive appropriate resource allocations and scheduling priorities.
Directory services improvements
NSS cache as an option for SSSD: Added support for NSS cache as an optional component for SSSD, providing improved performance for user authentication and directory lookups in large-scale deployments.
Advanced node management
Timeout-based forced deletion: Added an option for timeout-based forced deletion of nodeset pods even when a job is running on the node. This feature is disabled by default and provides administrators with additional control over node lifecycle management in emergency situations.
Configuration management enhancements
Multiple configmap support: The controller.etcConfigMap can now be configured as either a string or a list of multiple configmaps, providing greater flexibility in configuration management and allowing for more modular configuration approaches.
Custom segment visualization: A new custom script called segment-calc has been added for visualizing segments with block topology inside Slurm, helping administrators and users better understand resource allocation and scheduling decisions.
Security and authentication improvements
SSH authentication hardening: Set the default value for PasswordAuthentication in sshd to no, improving security by requiring key-based authentication by default.
Namespace label management: Added management of the ns.coreweave.cloud/managed namespace label, providing better control over namespace lifecycle and resource management.
Monitoring and observability
VMPodscrape option: Added an option for VMPodscrape for metric gathering, enhancing the monitoring capabilities and providing better visibility into system performance and resource utilization.
Configuration changes
Default settings
- Enhanced Slurm configurations with 24.11.05 features
- New CUDA image support for versions 12.8.1 and 12.9.0
- Improved container awareness and pod management
- Enhanced security defaults for SSH authentication
- Better monitoring and metric collection capabilities
Migration notes
Existing SUNK v6.x deployments will continue to work, but you may want to:
- Review the new Slurm 24.11.05 features and configurations
- Test compatibility with existing CUDA workloads
- Verify container awareness features work with your current deployments
- Update any custom monitoring scripts to leverage new VMPodscrape capabilities
- Review SSH authentication configurations if you rely on password authentication
Documentation
For detailed information about configuring and using SUNK v6.5.0, see: