March 2023

New this month on CoreWeave Cloud...

🎉 HGX H100 nodes are now online!

Big news! We are proud to announce that CoreWeave has become the first Cloud provider in the world to bring the super powerful NVIDIA HGX H100 nodes online!

The NVIDIA HGX H100 enables up to seven times more efficient high-performance computing (HPC) applications, up to nine times faster AI training on large models, and up to thirty times faster AI inference than the NVIDIA HGX A100.

This speed, combined with the lowest NVIDIA GPUDirect network latency in the market with the NVIDIA Quantum-2 InfiniBand platform, reduces the training time of AI models to "days or hours, instead of months." With AI permeating nearly every industry today, this speed and efficiency has never been more vital for HPC applications.

⚓ Introducing SUNK: Slurm on Kubernetes

Slurm is the de-facto scheduler for large HPC jobs in supercomputer centers around the world. CoreWeave's Slurm implementation, SUNK ("SlUrm oN Kubernetes"), integrates Slurm with Kubernetes, allowing compute to transition between distributed training in Slurm and applications such as online inference in Kubernetes.

As an implementation of Slurm on Kubernetes deployed on CoreWeave Cloud, SUNK comes complete with options for:

external Directory Services such as Active Directory
Slurm Accounting, backed by a MySQL database
dynamic Slurm node scaling to match your Workload requirements

In SUNK, Slurm images are derived from OCI container images, which execute on bare metal, and compute node resources are allocated using Kubernetes.

Note

CoreWeave maintains several base images for different CUDA versions, including all dependencies for InfiniBand and SHARP. If you'd like to implement SUNK in your cluster, please contact CoreWeave support for engineering support for cluster design and deployment.

⚡ Nydus is now on CoreWeave!

Embedding machine learning models directly into images has become a popular ease-of-use technique, but it has made image pull times slower due to the increased size of container images. As a result, pulling images is often the most time-consuming aspect of spinning up new containers, and for those who rely on fast autoscaling to respond to changes in demand, the time it takes to create new containers can pose as a major hurdle.

It's for this reason that CoreWeave Cloud now supports using Nydus, the external plugin for containerd, for shorter container image pull times.

Leveraging its own container image service, Nydus implements a content-addressable filesystem on top of a RAFS format for container images. This formatting allows for major improvements to the current OCI image specification in terms of container launching speed, image space, network bandwidth efficiency, and data integrity. The result: significantly faster container image pull times.

Important

Nydus on CoreWeave is currently an alpha offering, with limited, node-specific release.

💪 Distributed training using Kubeflow operators

The Kubeflow project is dedicated to making deployments of Machine Learning (ML) workflows on Kubernetes simple, portable, and scalable. The goal is not to recreate other services, but to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures. Anywhere you are running Kubernetes, you should be able to run Kubeflow.

CoreWeave is pleased to present new tutorials on using Kubeflow training operators for distributed training on CoreWeave Cloud! Follow along with these walkthroughs to train ResNet-50 with ImageNet, or fine-tune GPT-NeoX-20B with Argo Workflows!

💽 Import disk images using CoreWeave Object Storage

Disk images may be imported from external URLs to be used as source images for root or additional disks for Virtual Servers. In addition to qcow2, raw and iso formatted images are also supported, and may be compressed with either gz or xz.

Following our newly published guide, an image stored locally can easily be uploaded to CoreWeave Object Storage, then imported to a DataVolume.

🚢 Deploy custom containers on CoreWeave Cloud

Hosting your own containerized applications on CoreWeave Cloud is simple! With our new guide for deploying custom containers, you can have your applications running in CoreWeave Cloud in minutes!

🎉 HGX H100 nodes are now online!​

⚓ Introducing SUNK: Slurm on Kubernetes​

⚡ Nydus is now on CoreWeave!​

💪 Distributed training using Kubeflow operators​

💽 Import disk images using CoreWeave Object Storage​

🚢 Deploy custom containers on CoreWeave Cloud​