July 12, 2025
SUNK v6.6.0 release notes
New features
Area | Update |
---|---|
Identity & Access | SCIM provisioning for SUNK is now available via nsscache . This enables automated, standards-based user and group management from your IdP to CoreWeave clusters. See SCIM setup. |
Job Monitoring | Slurm job and node outputs now include direct links to their corresponding Grafana dashboards, giving operators one-click visibility into live job metrics. |
Instance Types | Added two new compute definitions: • rtxp6000-8x : NVIDIA RTX Pro 6000 Blackwell Server Edition• gb300-4x : NVIDIA GB300 |
Observability | • Slurm metrics now carry the slurm_cluster label, simplifying multi-cluster dashboards.• MySQL exporter metrics are automatically scraped and ingested. • Enhanced segment-calc script to respect partition filters and job exclusivity, making block-scheduling heat-maps more accurate. |
Benchmarking | NCCL-test base image updated to nccl-tests/d5a135d , ensuring compatibility with the latest CUDA toolchain. |
Improvements
- Nodes that stay "busy" inside a reservation are automatically re-evaluated after 30 minutes, reducing orphaned allocations.
- CoreWeave IAM is now fully integrated with the Slurm Helm chart.
- Optional SSSD mounts are intelligently gated, reducing unnecessary container overhead.
Fixes
- Disabled NVIDIA device-plugin health checks that could cause false node drains.
- Segment-calc now skips nodes already in DRAIN state to prevent skewed capacity charts.
- PodMonitor and VMPodScrape templates now use consistent relabeling syntax.
- Removed the InfiniBand requirement for A100-based nodes where it is not present.
- Multiple operator dependencies updated (chi v5, viper v2, Go Slurm) to incorporate upstream security and stability patches.
Upgrade notes
SCIM setup for SUNK
Suggested SCIM settings are found in the Slurm chart's values-cw.yaml
at nsscache.nsscacheConfig
.
To set up SCIM provisioning for SUNK, provide your SCIM auth token in a Kubernetes Secret. This token is used to authenticate with your IdP.
- In the Kubernetes Secret, set the value of the
nsscache-scim-auth-token
key to your Token. - Set
nsscache.existingSecret
in thevalues-cw.yaml
file of theslurm
chart to the name of the Secret. - Set
nsscache.nsscacheConfig.default.base_url
in thevalues-cw.yaml
file of theslurm
chart to the base URL of your SCIM server, such ashttps://api.coreweave.com/scim/<org>
.
NSSCache configuration
SCIM provisioning uses the nsscache
component. When nsscache
is enabled, it's advised to disable SSSD by adjusting the following settings:
- Set
sssdContainer.enabled: false
in thevalues.yaml
of theslurm
chart. - Set
directoryCache.source: nsscache
in thevalues.yaml
of theslurm-login
chart.
NVIDIA health checks
NVIDIA's device-plugin health reporting now defaults to false
. If you rely on NVIDIA's device-plugin health reporting, re-enable these checks by setting device-plugin.healthCheck: true
.