The changelog encompasses any and all changes to CoreWeave products and new products or features. This page surfaces all customer-facing product changes with links to relevant documentation and more detailed release notes, where applicable.Documentation Index
Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
Use this file to discover all available pages before exploring further.
For changelog entries prior to December 2024, please see the CoreWeave Classic documentation.
The supported GPU drivers have changed. See Supported driver versions for the full compatibility table.
- GB200 NVL72-powered instances
- Drivers removed:
570
- Drivers removed:
- A100
- Drivers removed:
570
- Drivers removed:
- H100 (InfiniBand)
- Drivers removed:
570
- Drivers removed:
- H200 (InfiniBand)
- Drivers removed:
570
- Drivers removed:
- L40
- Drivers removed:
570
- Drivers removed:
- L40S
- Drivers removed:
570
- Drivers removed:
- RTX Pro 6000 Blackwell Server Edition
- Drivers removed:
570
- Drivers removed:
The supported GPU drivers have changed. See Supported driver versions for the full compatibility table.
- B200 (InfiniBand)
- New default:
580→595
- New default:
- B300 (InfiniBand)
- Drivers added:
595
- Drivers added:
- GB200 NVL72-powered instances
- New default:
580→595
- New default:
- GH200
- New default:
580→595
- New default:
- H100 (InfiniBand)
- New default:
580→595
- New default:
- H200 (InfiniBand)
- New default:
580→595
- New default:
- L40
- New default:
580→595
- New default:
- L40S
- New default:
580→595 - Drivers added:
535,570,595
- New default:
The Cabinet Wrangler and Cabinet Visualizer dashboards now display rack name as the primary label for filtering and identification, replacing NVLink domain. Both metrics remain available in the dashboards. See the Cabinet Wrangler release note for more information.
Version 1.18.0 of the CoreWeave cert-manager Helm chart switches the bundled Let’s Encrypt
ClusterIssuers from HTTP01 to DNS01 challenges, resolved through a CoreWeave webhook at acme.coreweave.com. An ingress controller is no longer required for certificate issuance, and wildcard certificates are now supported. See the cert-manager DNS01 release notes for more information.The Cloud Console now includes a native Billing insights page that shows billable usage, measured usage, and exclusions across your workloads without leaving the console. See the Billing insights release notes for more information.
CoreWeave Grafana now opens to a home page that gives you an immediate view of your environment without navigating to individual dashboards. The home page includes the latest platform announcement, an environment overview with GPU node counts and allocation, quick links to top dashboards, and a live feed from the CoreWeave status page. See the CoreWeave Grafana home page release notes for more information.
CoreWeave AI Object Storage’s Local Object Transport Accelerator (LOTA) now runs on CPU Nodes in addition to GPU Nodes. CPU-only CKS clusters can now use the LOTA endpoint for accelerated object storage access, and cache capacity scales with cluster size across all Node types. See the LOTA on CPU Nodes release notes for more information.
CoreWeave Alerts is now available, delivering real-time notifications about your clusters, deployments, and operations. Route alerts to Slack through an OAuth integration or incoming webhook, or to any HTTPS endpoint using a generic webhook with optional signature verification. Setting up these integrations requires the new Notifications Admin IAM role, while viewing the notifications in Cloud Console requires the new Notifications Viewer role. See the CoreWeave Alerts release notes for more information.
CoreWeave Inference is now available, providing multiple ways to deploy and serve AI models on CoreWeave GPU infrastructure. Serverless Inference lets you deploy models without managing infrastructure. Dedicated Inference lets you deploy custom model weights on dedicated GPU infrastructure with OpenAI-compatible API endpoints, using runtimes such as vLLM or SGLang. Inference on CKS gives you full control over your inference deployment stack using CoreWeave Kubernetes Service. See the CoreWeave Inference release notes for more information.
Two new tutorials are available for running GPU workloads in interactive marimo notebooks on CKS: a JAX training tutorial that streams live loss charts to the browser as training progresses, and a TensorRT-LLM inference tutorial with an interactive model picker supporting models including TinyLlama, Phi-3.5-mini, Mistral 7B, and Llama-3.1 8B FP8. Both tutorials use the
kubectl-marimo CLI plugin and require the marimo operator installed on your cluster. See the marimo JAX and TensorRT-LLM release notes for more information.Documentation is now available for Dedicated VAST Storage, CoreWeave’s single-tenant VAST clusters co-located with your GPU infrastructure. Each cluster is physically isolated to a single tenant with direct access to the VAST Management System (VMS), multi-protocol support (NFS, S3, and SQL), and advanced data services including VAST Catalog, DataBase, DataEngine, and cross-cluster replication. See the Dedicated VAST Storage release note for more information.
Support Access Management gives you visibility and control over CoreWeave employee access to your CKS environment. All CoreWeave support access is request-based, requiring approval from a member of your organization with the
Access Request Approver role. Approved access automatically expires after 8 hours, and all sessions are fully auditable. Teleport audit logs and Kubernetes audit logs can be forwarded automatically through CoreWeave Telemetry Relay. See the Support Access Management release notes for more information.CoreWeave AI Object Storage now supports conditional writes. Attach HTTP precondition headers (
If-None-Match or If-Match) to PutObject, CompleteMultipartUpload, and CopyObject requests to make writes atomic and prevent accidental overwrites without client-side locking. See the conditional writes release notes for more information.CoreWeave Omni is now available. CoreWeave Omni is a cloud-as-a-service model in which CoreWeave deploys and operates the full CoreWeave cloud stack inside your data center. You retain ownership of the facility and hardware while CoreWeave delivers a managed region, the CoreWeave Cloud Platform, and day-to-day operational management. For availability, sizing, and pricing, contact your CoreWeave account team. See the CoreWeave Omni release notes for more information.
Logs are now available from Super Regional data sources in US East, US West, and EU South, alongside the Global logs source. Super Regional Grafana data sources let you query application and platform logs in the region where they were generated, with separate Super Regional sources for CKS audit logs. See Metrics and logs data sources for endpoints, data source names, and when to use Global versus Super Regional queries.
A new tutorial is available for deploying Spegel, a stateless peer-to-peer OCI registry mirror, on CKS. Spegel speeds up container image pulls by sharing image layers across cluster nodes using a distributed hash table, reducing external registry traffic. CKS clusters are pre-configured with the required containerd settings. See the Spegel release note for more information.
B300 (InfiniBand) instances are now available. These instances are powered by eight NVIDIA B300 Blackwell GPUs and deliver higher performance per GPU, 50% more GPU memory, and double the InfiniBand speed compared to B200 systems. B300 instances are available in US-EAST-13A and US-WEST-01A. Contact us for pricing.
The ncore image and GPU driver compatibility table has been updated. B300 (InfiniBand) is now included with latest supported ncore image
ncore-image-2.33.0 and compatible GPU drivers 580 and 590. See GPU driver management and Update GPU driver version for details.CoreWeave AI Object Storage now supports pre-staging objects into the LOTA cache. A single HeadObject call triggers LOTA to fetch the complete object from backend storage and place it in the distributed NVMe cache, eliminating cold-start latency for training, inference, and checkpoint-restore workloads. See the pre-stage cache release note for more information.
SUNK v7.3.0 has been released. This release enables explicit naming of Slurm nodes through configurable Kubernetes labels. Additionally, nodes in
drain with the duplicate job id reason will now be picked up by the automatic HPC verification workflow. This release also adds the ability to capture logs from slurmd and slurmstepd, and includes several bug fixes. For more information, see the SUNK v7.3.0 release note.Spot Node Pools are now available in CKS, providing pay-as-you-go access to high-performance, preemptible compute resources without long-term commitments. A new Capacity plans overview page compares all four CKS capacity models side by side. See the Spot Node Pools release note for more information.
A new tutorial is available for deploying an OCI container registry on CKS using Zot, with image storage in CoreWeave AI Object Storage and LOTA for in-cluster performance. CoreWeave has validated Zot against the OCI Distribution Specification conformance suite. See the Deploy a container registry on CKS with Zot release notes for more information.
CKS Node Pools now support comprehensive configuration management with staged updates, rollback capabilities, and full visibility into configuration history.Node Pool configurations define the desired state for Nodes, including ncore image, GPU driver, and Kubernetes versions. The Node Pool Status now tracks the active configuration, staged pending configurations awaiting user approval, and a history of all applied configurations.Use the CoreWeave Intelligent CLI (
cwic) to upgrade Node Pools to pending configurations or roll back to previous configurations. See the Node configuration visibility and management release notes for more information.CoreWeave AI Object Storage now supports the RenameObject API for atomic, server-side object renaming.The RenameObject API provides atomic rename operations within the same bucket, completing in milliseconds regardless of object size. Unlike copy-and-delete workflows, RenameObject updates metadata only, no data is copied and no temporary storage duplication occurs.For more information, see the RenameObject release notes.
The Usage by Product and Zone dashboard now shows two usage layers for each resource type, making the difference between total metered usage and billing-basis usage visible.The Usage by Product and Zone dashboard in CoreWeave Observe™ now displays two sets of cards for each resource type (GPUs, CPUs, Storage, IP Addresses):
- Measured usage: All metered usage before any exclusions are applied.
- Net usage: Usage after CoreWeave-level exclusions and adjustments. This is the basis for billing, subject to your contract terms (rates, discounts, and credits).
Distributed File Storage documentation now includes instructions for rebinding PVCs to new namespaces using the rebind-pvc.sh utility script. This automated tool simplifies making persistent volumes available across different namespaces. See the PVC namespace rebinding release notes for more information.
A new tutorial is now available for deploying self-hosted GitHub Actions runners on SUNK.The Run GitHub Actions Runners on SUNK guide walks through the process of using the Actions Runner Controller (ARC) to create and manage runners for both CPU and GPU workloads. See the release notes for more information.
CKS now supports Kubernetes v1.35.For all new CKS clusters, v1.35 is now the default version.Support for v1.32 for new clusters has been deprecated, but existing clusters running v1.32 will continue to work. See Cluster Components for the full list of supported versions.
CKS documentation now includes a tutorial on running marimo notebooks. The tutorial covers deploying and managing marimo notebooks on CKS and connecting them to AI Object Storage for data access. To get started, see Run marimo notebooks on CKS.
SUNK v7.2.0 has been released. This release adds a default timeout for MySQL probes and adds the ability to configure image registries. Images are now published to the new default registry at
ghcr.io/coreweave/slurm-containers; images are no longer published to docker.artifacts.coreweave.com. See the SUNK v7.2.0 release notes for more information.The Direct Connect documentation has been updated with new CoreWeave DX locations in North America and Europe. For more information, see CoreWeave DX locations.
SUNK v7.1.1 has been released. This release improves error handling when syncing users, enables reporting for
nil plugin types, and fixes other bugs within the Slurm chart.The Direct Connect documentation has been enhanced with more comprehensive information about Dedicated and Virtual DX options. Updates include:
- Detailed comparison table between Dedicated DX and Virtual DX connectivity options
- New sections on distinguishing features for each connectivity option
- Updated connection process information for Virtual DX through Equinix Fabric and Megaport
- Added information about on-demand provisioning and network expansion
- CoreWeave Observe™ now includes the Cluster Resource Overview dashboard. This dashboard provides a comprehensive overview of your cluster’s health and resource utilization. To see the documentation about this dashboard, go to the Cluster Resource Overview page.
-
CoreWeave Observe™ now writes and stores metric data from
US-WEST. For more information, see Data sources.
Non-admin users can now perform AI Object Storage actions in the Cloud Console when granted specific permissions via organization access policies. See the AI Object Storage Console Access release notes for more information, and the Console Permissions Reference for the list of permissions required.
SUNK v6.10.0 has been released. This release adds additional metrics, updates the Slurm container version, updates queue counts to include array jobs, and includes multiple bug fixes. For more information, see SUNK releases v6.10.0 and v7.1.0.
SUNK v7.1.0 has been released. This release adds additional metrics, updates queue counts to include array jobs, and includes multiple bug fixes. For more information, see SUNK releases v6.10.0 and v7.1.0.
CoreWeave introduces three new features:
- CoreWeave Mission Control Agent is in Private Preview.
- Telemetry Relay is now Generally Available.
- Mission Control’s GPU Straggler Detection is now in Private Preview.
CKS no longer maintains the
allocatedNodes Node PoolStatus. See Node PoolStatus for the list of Node PoolStatus fields.IAM Access Policies are now available. This feature allows you to control access to resources in the CoreWeave platform. For more information, see IAM Access Policies.Automated User Provisioning (AUP) is now available. This feature allows you to synchronize users and groups from an Identity Provider (IdP) to the CoreWeave platform. For more information, see Automated User Provisioning.SUNK User Provisioning (SUP) is now available. This feature allows you to synchronize users and groups from an Identity Provider (IdP) or directly from CoreWeave IAM to a SUNK cluster. For more information, see SUNK User Provisioning.OIDC Workload Identity Federation is now available. This feature allows you to authenticate CKS workloads to external cloud services and AI Object Storage using OIDC tokens. For more information, see OIDC Workload Identity Federation.
Flexify Inc. has been added to CoreWeave’s Sub-processors list.
SUNK v7.0.0 has been released. This release introduces a major version upgrade to Slurm 25.05.3, more consistent node scaling behavior, new default Slurm chart configurations, improved memory management, and bug fixes. For more information, see SUNK v7.0.0 release notes.
CoreWeave Observe™ documentation now lists Node alerts in the Kubernetes Training Jobs and Slurm Job Metrics pages.You can also view the list of Node alerts in the Node Pool reference documentation.
CKS no longer maintains the following Node Pool conditions:
AcceptedAllocatedSufficientCapacity
CKS can enable workloads to use IMEX with Dynamic Resource Allocation (DRA). This is a limited availability feature. To learn more, go to Enabling IMEX Compute Domains with Dynamic Resource Allocation. See the IMEX with DRA release notes for more information.
SUNK v6.9.1. has been released. This is a patch release that fixes an issue related to excessive reconfigures triggered by watching the
topology.conf file. This patch also improves handling of large numbers of jobs in the completing state, and increases the time a job is allowed to be in the completing state to prevent early termination of completing jobs. For more information, see SUNK v6.9.1 release notes.CoreWeave AI Object Storage Usage-Based Billing is now available. This introduces a third tier of storage pricing for AI Object Storage: Hot, Warm, and Cold. Usage-Based Billing replaces the previous Automated Archive feature, and is enabled by default starting October 31, 2025.
CoreWeave AI Object Storage Inventory Reports are now available. This feature allows you to view and download reports on your AI Object Storage usage and inventory. Learn how to generate inventory reports for your AI Object Storage buckets.
SUNK documentation now contains a tutorial on running torchforge on SUNK. To view and complete the tutorial, go to Run torchforge on SUNK.
SUNK now includes a new tutorial on running Ray on SUNK. To access and complete the tutorial, go to Run Ray on SUNK.
General Access Region
EU-SOUTH-04 in Alava, Spain is now available and supports CoreWeave AI Object Storage.SUNK v6.9.0 has been released. This release introduces automatic job requeueing during rolling upgrades, new configuration options for
slurmrestd, improved resource optimization, and a new command alias. For more information, see SUNK v6.9.0 release notes.Dedicated Access Region
US-EAST-11 in North Carolina, USA is now available.CoreWeave AI Object Storage quota limits have been increased to 100 TiB per Availability Zone.
Dedicated Access Region
US-CENTRAL-04 in Texas, USA is now available.CoreWeave AI Object Storage Automated Archive is now available. This feature automatically archives inactive objects after 30 days.
SUNK v6.8.0 has been released. This release adds a cleanup script for jobs stuck in a completing state, introduces a cache-dropper sidecar for compute pods, updates SCIM parameters to filter inactive users, fixes a race condition in slurmd startup when using cgroupv2, and enhances the syncer to more appropriately issue a
scontrol reconfigure when nodes are added, as well as a number of bug fixes. For more information, see SUNK v6.8.0 release notes.CoreWeave Security documentation has been added. See CoreWeave Security for more information.
Dedicated Access Region
CA-EAST-01 in Ontario, Canada is now available and supports CoreWeave AI Object Storage.CKS documentation now includes instructions for setting up and running Kubeflow on CKS.
CKS now supports Kubernetes v1.34, bringing the latest features and security updates to CKS clusters. v1.34 is now the default version for all new CKS clusters. Support for v1.31 for new clusters has been deprecated, but existing clusters running v1.31 will continue to work.
All newly created CKS clusters will have Cilium v1.18.1 as their default Container Network Interface (CNI).
CKS documentation now contains instructions for using third-party frameworks on CKS. For more information, go to Introduction to Third-Party Frameworks.
CoreWeave Observe™ includes two new dashboards: Slurm Block Topology and Kueue Metrics. See CoreWeave Observe™: Slurm Block Topology and Kueue Metrics for more information.
CKS now supports Kubernetes v1.33, bringing the latest features and security updates to your Kubernetes clusters. This release also includes cgroup v2 as the default control group version. See the August 26, 2025 release notes for detailed information.
The CKS External Hostname Controller has changed the way it reports DNS names for services running on CKS. See CKS External Hostname Controller changes for more information.
In preview: CKS now offers cluster autoscaling. For more information, see Autoscale Node Pools.
CoreWeave’s GB300 NVL72-powered cloud instances are now available in select Regions.
CKS now includes a new tutorial on scaling vLLM inference workloads. To access and complete the tutorial, go to Deploy vLLM for Inference.GPU driver management features are now available in CKS Node Pools, allowing you to specify and target specific GPU driver versions for your workloads. This feature provides better control over driver compatibility and enables homogeneous driver environments across your clusters. See GPU driver management features release notes for detailed information.
SUNK v6.7.0 has been released, introducing support for CUDA 12.9, enhanced SCIM and
nsscache functionality - such as filtering and home directory overrides - HDF5 plugin support, and various bug fixes for directory service integration, GPU detection, and Slurm task management. See SUNK v6.7.0 release notes for more information.CoreWeave Observe™ includes three new dashboards and reorganized folders. See CoreWeave Observe™: New dashboards and reorganized folders for more information.
The CoreWeave Terraform provider now supports AI Object Storage, enabling infrastructure-as-code management of buckets, policies, lifecycle configurations, and versioning. See CoreWeave AI Object Storage Terraform Provider Support for more information.
CoreWeave AI Object Storage now supports server-side encryption with customer keys (SSE-C), providing enhanced data security and control for stored objects. See CoreWeave AI Object Storage SSE-C support for detailed information.
CKS now supports Kubernetes upgrades to take advantage of the latest features and security updates. See CKS Kubernetes upgrade support for detailed information.Default control group version changed to v2 for CKS clusters targeting Kubernetes v1.33, aligning with upstream Kubernetes support policy. See CKS Kubernetes upgrade support for more information.
CoreWeave Telecaster™ is now available in CoreWeave Observe™, providing fully-managed log and metric forwarding to external destinations. See the Telecaster release notes for more information.
Added default value for pool size in Helm charts to prevent configuration issues.Segment-calc now skips Nodes already in DRAIN state to prevent skewed capacity charts.
SCIM provisioning for SUNK is now available via
nsscache. This enables automated, standards-based user and group management from your IdP to CoreWeave clusters. See SUNK v6.6.0 release notes for detailed information.Slurm job and Node outputs now include direct links to their corresponding Grafana dashboards, giving operators one-click visibility into live job metrics. See CoreWeave Grafana for more information.
Added two new compute definitions:
rtxp8x (NVIDIA RTX Pro 6000 Blackwell Server Edition). See Instances and the RTX Pro 6000 release notes for detailed specifications.Slurm metrics now carry the
slurm_cluster label, simplifying multi-cluster dashboards. See CoreWeave Grafana for monitoring capabilities.MySQL exporter metrics are automatically scraped and ingested. See CoreWeave Logs and Metrics for querying capabilities.NCCL-test base image updated to
nccl-tests/d5a135d, ensuring compatibility with the latest CUDA toolchain.CoreWeave IAM is now fully integrated with the Slurm Helm chart. See SUNK for more information.Optional SSSD mounts are intelligently gated, reducing unnecessary container overhead. See Directory Services for configuration details.Nodes that stay “busy” inside a Reservation are automatically re-evaluated after 30 minutes, reducing orphaned allocations. See Node Lifecycle for more information.
Disabled NVIDIA device-plugin health checks that could cause false Nodedrains.Multiple operator dependencies updated (chi v5, viper v2, Go Slurm) to incorporate upstream security and stability patches.
PodMonitor and VMPodScrape templates now use consistent relabeling syntax.
Removed the InfiniBand requirement for A100-based Nodes where it is not present. See Instances for A100 specifications.
SUNK v6.6.0 has been released with SCIM provisioning via
nsscache, enhanced monitoring with dashboard links, improved node reconciliation, new GPU compute definitions (rtxp8x), metrics improvements, segment-calc script enhancements, and base image upgrades. This release also includes automatic scraping of MySQL metrics, fixes for metrics labeling, and improved segment-calc handling for DRAIN nodes. See SUNK v6.6.0 release notes for detailed information.RTX Pro 6000 Blackwell Server Edition cloud instances are now available in select CoreWeave Availability Zones. These instances combine NVIDIA’s RTX Pro 6000 Blackwell Server Edition with CoreWeave’s managed services, observability, and high-performance networking. See RTX Pro 6000 Blackwell Server Edition release notes for detailed information.
Encryption at rest for Kubernetes Secrets is now enabled by default in all CoreWeave Kubernetes Service (CKS) clusters. This feature uses a KMS-backed integration to encrypt etcd data automatically. See the CKS encryption at rest release notes for more information.
New Kubernetes API endpoint for unmanaged auth is now available in CKS, enabling custom authentication workflows. See the Unmanaged auth API release notes for more information.
Control Plane Node Pools are no longer provisioned in CKS clusters. These changes improve cluster provisioning speed and reliability while enabling custom authentication workflows. See the July 7, 2025 release notes for detailed information.
Node Pool condition transition improvements for better cluster management and monitoring. See Node Pool condition transition release notes for detailed information.
Support for NVSHMEM and GDRCopy is now available, enabling high-performance GPU-to-GPU communication. See NVSHMEM and GDRCopy release notes for detailed information.
CKS cluster management improvements with enhanced Node Pool management. See CKS Clusters for more information.
SUNK v6.5.0 has been released with major improvements to monitoring, system stability, and resource management. This release introduces enhanced dashboard integration for Slurm jobs and nodes, improved metrics labeling, automatic MySQL metrics scraping, and new compute definitions. It also includes fixes for NVIDIA device-plugin health checks, segment-calc handling for DRAIN nodes, and updates to operator dependencies. See SUNK v6.5.0 release notes for detailed information.Slurm upgraded to 24.11.05, bringing in the latest upstream fixes and enhancements. See SUNK for more information.NCCL bumped to 2.26.5, improving GPU communication performance.Added new CUDA runtime images for 12.8.1 and 12.9.0.Introduced
nsscache as an alternative option to SSSD for user caching. See Directory Services for configuration details.Enabled timeout-based forced deletion of compute pods (disabled by default), allowing cleanup even when jobs are still running.Backported Slurm 25.05 SlurmdSpecOverride and container awareness features to correctly configure CPUSpecList and MemSpecList, so static pod workloads no longer enter an invalid state after scontrol reconfigure.Enhanced controller.etcConfigMap to accept either a single string or a list of multiple ConfigMaps.Added the segment-calc script for visualizing block-topology segment allocations. See Topology/Block Scheduling in Slurm for more information.CKS now supports Kubernetes v1.32.
Defaulted
PasswordAuthentication to no in sshd for improved security.Charts now manage the ns.coreweave.cloud/managed namespace label.Added support for VMPodScrape as an alternative to PodMonitor for metric gathering.
Cabinet Wrangler is now available for managing cabinet-level operations and monitoring. See Cabinet Wrangler release notes for detailed information.
SUNK v6.4.1 has been released as a patch release with critical memory parsing fixes, improved MOTD script handling, container runtime enhancements, and RDMA configuration cleanup. This release addresses important issues discovered in v6.4.0, including a critical memory parsing fix, improved login template configuration, and enhanced container runtime stability. All v6.4.0 deployments should upgrade to v6.4.1 to resolve these issues. See SUNK v6.4.1 release notes for detailed information.
NVIDIA HGX B200 instances are now Generally Available, providing next-generation AI compute capabilities. See NVIDIA HGX B200 instances GA release notes for detailed information.
SUNK v6.4.0 has been released with significant improvements to login pod management, configuration capabilities, and user experience. This release introduces external MySQL database configuration in the Slurm Helm chart, improved hostname resolution for login pods, customizable MOTD display, user-controlled pod reboot, enhanced error handling, and dashboard integration features. See SUNK v6.4.0 release notes for detailed information.
Internet Transit Dashboard is now available, providing real-time visibility into network traffic and performance. See Internet Transit Dashboard release notes for detailed information.
New Node Pool UI enhancements for improved cluster management experience. See Node Pool UI enhancements release notes for detailed information.
New features in SUNK v6.3.0 including enhanced Slurm functionality and performance improvements. See SUNK v6.3.0 release notes for detailed information.
Node ID Format Change implemented for improved system identification and management. See Node Lifecycle for more information.
SUNK v6.2.0 has been released with Device Plugin chart integration, Slurm upgrade to v24.11.4, AllowGaps patch for improved scheduling, and configurable operator log levels. See SUNK v6.2.0 release notes for detailed information.Slurm Device Plugin Helm chart has been integrated as a subchart in SUNK, simplifying GPU resource provisioning within clusters managed by Slurm.Slurm has been patched to support the
AllowGaps setting in topology.conf, allowing for non-contiguous Node groupings in block topology mode.The SUNK operator now includes configurable log levels, which can be set through Helm values for fine-grained control over log verbosity.A new
drain_time_seconds metric has been added for Slurm nodes, reporting how long a Node has been in the DRAIN or DRAINING state.A new compute Node type for CPU-only Nodes has been defined in the Helm charts, enabling deployment scenarios that do not require GPU-specific configurations. See CPU Instances for available options.
“Explore” Now Available in CoreWeave Observe™, providing enhanced data exploration capabilities. See Grafana Explore release notes for detailed information.
New features in SUNK v6.1.0 including enhanced Slurm functionality and performance improvements. See SUNK release notes for more information.
CoreWeave AI Object Storage is now Generally Available, providing high-performance object storage optimized for AI workloads. See CoreWeave AI Object Storage GA release notes for detailed information.AI Object Storage now supported in an additional Availability Zone - US-EAST-01A. See CoreWeave AI Object Storage GA release notes for availability details.
Brand new Cloud Console UI for AI Object Storage with enhanced user experience and streamlined data management. See CoreWeave AI Object Storage GA release notes for detailed information.
Introducing CoreWeave AI Object Storage, a new high-performance object storage solution designed specifically for AI and machine learning workloads. See CoreWeave AI Object Storage for more information.
SUNK v6.0.0 has been released with significant new features and breaking changes. See SUNK v6.0.0 release notes for detailed information.
CoreWeave Kubernetes Service (CKS) API is now Generally Available, enabling programmatic deployment, management, and scaling of HPC applications using Kubernetes on CoreWeave’s high-performance infrastructure. See CKS API and Terraform provider release notes for detailed information.CoreWeave Terraform provider is now available, allowing customers to deploy and manage VPCs and CKS clusters as code. See CKS API and Terraform provider release notes for detailed information.
Enhanced Cloud Console design and user experience with improved usability and creation flows for faster cluster deployment and better resource management. See CKS API and Terraform provider release notes for detailed information.
SUNK v5.7.0 has been released with a change to using direct RPCs to the Slurm controller instead of the REST API. The REST API is now an optional component and must be explicitly enabled if required. See SUNK v5.7.0 release notes for detailed information.SUNK v5.6.0 released with enhanced Slurm login functionality and improved compute definitions. See SUNK for more information.Added individual Slurm login pods implementation with user cache controller for improved authentication management.Added GB200 compute definition to support the latest NVIDIA hardware.Added CUDA 12.8 image builds for enhanced GPU support.Upgraded Slurm to 24.05.05 with latest upstream fixes and improvements.Enhanced block topology configuration with automatic generation from labels for improved GPU scheduling.Added readiness probe to slurmd for better health monitoring.Fixed syncer cluster role binding name to prevent deployment issues.
Removed default CPU limit for login pods to improve performance.Updated directory-cache image to include OS suffix for better compatibility.Fixed nvlink domain handling to skip domains labeled “0” (no domain).Improved resource usage calculation by ignoring completed pods.
GB200 NVL72-powered cloud instances are now available in selected CoreWeave Regions, combining NVIDIA’s GB200 Superchips in a 72-GPU NVLink-connected fabric with CoreWeave’s managed services. See GB200 NVL72 instances release notes for detailed information.
H100 and H200 based instances now support NV HGX 1.5.0 firmware, delivering enhanced GPU stability and improved troubleshooting capabilities. See H100 and H200 firmware update release notes for detailed information.
SUNK v5.5.0 released with improved resource cleanup and Slurm login chart implementation. See SUNK for more information.
SUNK v5.4.0 released with enhanced Slurm login functionality and improved compute definitions. See SUNK for more information.Added single projected volume for SSSD to simplify configuration and improve security.Added dynamic feature prefixing for flexible feature configuration.Added H200 compute definition to support the latest NVIDIA hardware.Enhanced Slurm login chart with improved pod specification handling.Added cleanup of Slurm nodes following removal from NodeSlices for better resource management.Implemented LoginReconciler for improved login pod management.Updated NCCL base images to newer versions with HPC-X 2.21 for enhanced performance.Added OwnerReference for resource cleanups to improve resource management and prevent orphaned resources.Implemented slurm-login chart for better login node management.
Fixed affinity configuration in compute base definitions.Added InfiniBand support to H200 compute definitions.Removed Ubuntu 20.04 image builds to focus on supported versions.Fixed lock annotation removal when nodes are removed from nodesets.Upgraded to Go 1.23.2 for improved performance and security.Fixed ignore_group_members configuration by renaming to ignoreGroupMembers for consistency.Corrected login pod template indentation to prevent deployment issues.Updated LDAP secret key defaults to use ldap-password.conf for better compatibility.
SUNK v5.3.0 released with enhanced Slurm functionality and improved monitoring capabilities. See SUNK for more information.Added GH200 compute definition to support the latest NVIDIA hardware.Enhanced login SSH daemon liveness probe for better health monitoring.Added scripts for deleting NVIDIA hooks on CPU nodes to prevent conflicts.Allowed list of prolog/epilog configmaps in Helm values for flexible configuration.Exposed all probes for all containers in Helm values for comprehensive monitoring.Moved Slurm secret manifests to secret job for improved security.Enhanced Node Extras handling to prevent overwriting of extra fields.Improved condition synchronization from pods to nodes for better state management.Added SSSD config reload capability for dynamic configuration changes.Upgraded Slurm to 24.05.4 with latest upstream fixes and improvements.
Fixed Slurm probe indentation in Helm charts.Corrected MySQL resource defaults for better performance.Made MySQL secret immutable and persistent for improved security.Removed defunct CgroupAutomount option to prevent configuration errors.Enhanced persistent connections to slurmctld for improved stability.Fixed Slurm completion script permissions for proper execution.Updated Slurm image dependencies for better compatibility.Upgraded Ubuntu images to newer tags for security and performance.Fixed array job merge behavior for metrics collection.Corrected scheduler hook bug when pods are deleted before hook execution.Improved condition update handling on pods for better state management.Fixed termination grace error handling for improved reliability.
SUNK v5.2.0 released with enhanced Slurm PAM module support and improved monitoring. See SUNK for more information.Added packages to support Slurm PAM module for enhanced authentication capabilities.Added host aliases to Slurm chart for improved networking configuration.Added Slurm not responding condition for better health monitoring.Enhanced operator syncer and scheduler configuration for improved performance.Switched to cgroup process tracking as default in Helm charts for better resource management.
Fixed leader election configuration to not force it by default for SUNK.Fixed missing volumes on REST deployment for proper functionality.Updated disk space check for MySQL init container to prevent deployment issues.Used templates for operator scheduler and syncer configs for consistency.Moved hooksapi out of syncer and scheduler configs for better separation of concerns.Reevaluated Slurm controller liveness probe for improved health checking.Reduced noisy messages from user-lookup container for cleaner logs.
SUNK v5.1.0 released with enhanced monitoring capabilities and improved configuration options. See SUNK for more information.Added additional slurmdbd.conf lines to Helm values for flexible database configuration.Allowed additional DNS config searches for improved name resolution.Added custom plugstack.conf entries support for enhanced Slurm configuration.Exposed compute liveness probe configuration for better health monitoring.Made field labels and metrics consistent across the platform.Added Slurm job uptime metrics for better job monitoring.Exposed Slurm RPC stats for Prometheus metrics collection.Renamed diagnostic metrics and fixed pointer checks for improved monitoring.Updated base images for all image builds to latest versions.
Fixed user-lookup enablement to only activate when canary users are set.Added missing labels to resources for better organization.Adjusted slurmd default timeout to 60 seconds for better performance.Fixed scheduler script to prevent Slurm bug in job handling.Included topology.conf in watched files again for proper configuration monitoring.Properly added additional configuration to plugstack.conf for enhanced functionality.Set default max_rpc_cnt for SchedulerParameters to prevent issues.Unified approach to labels on SUNK chart for consistency.Patched default show flags in REST API for nodes to improve visibility.Removed unrecognized configure options from Slurm Dockerfile to prevent build issues.Implemented deduplication of Get requests in Slurm client for improved performance.Injected missing SLURM_CLUSTER_NAME environment variable in compute nodes.Corrected URLs to documentation for better user experience.Fixed nil features clearing issue in operator for better state management.Excluded pods not in Ready state from auto cleanup to prevent data loss.Started API health check after Slurm client initialization for proper startup sequence.
SUNK v5.0.0 released with major upgrade to Slurm 24.05.x and enhanced security features. See SUNK for more information.Upgraded to Slurm 24.05.0 with latest upstream features and improvements.Added leader election for operator to improve reliability in multi-instance deployments.Enabled pyxis and security capabilities by default for enhanced container security.Upgraded default resources for Slurm components to improve performance.Enhanced plugstack.conf customization in Helm for flexible configuration.Upgraded enroot and pyxis to latest versions for improved container management.Updated exported node metrics from Slurm for better monitoring.Added orphaned pod checking for better resource cleanup.Updated controller-runtime to 0.18.3 for improved Kubernetes integration.
Changed default Munged UID/GID and allowed configuration for better security.Bumped scrape timeout for syncer to prevent monitoring issues.Added temporary fix for inode locking issue to prevent file system problems.Dropped CUDA version in values-cw.yaml back to 12.2 for compatibility.Updated JWT secret to use infinite lifespan for better security.Corrected license dates in documentation header templates.Added condition delay check for nodes in tests to improve reliability.Added replica check for e2e tests to ensure proper deployment.Corrected scaleDeployment bug for checking incorrect pods.Upgraded Slurm to 24.05.1 with latest patch fixes.Made micromamba executable for proper package management.Properly handled node info updates in nodeslice for better state management.Changed array job merge behavior for improved job handling.Corrected minor behavior in nodeset scaling for better resource management.Fixed pod node assignment handling to prevent errors when nodes are not assigned.Improved clarity of scheduler errors for better troubleshooting.Moved kubectl installation from script to Dockerfile for better build process.Stopped prestop lifecycle hook from overriding existing reasons for better job management.Moved test setup into BeforeAll block for improved test organization.Added
--load-images flag to skaffold for better development workflow.