Model Training and Fine-tuning

Welcome to Machine Learning on CoreWeave Cloud

Training Machine Learning models, especially models of modern Deep Neural Networks, is at the center of CoreWeave Cloud's architecture. The entire CoreWeave Cloud stack is purpose-built to enable highly scalable, cost-efficient model training.

In addition to its core tech stack, CoreWeave has a history of supporting our customers with cutting-edge Machine Learning research through in-house expertise, industry partnerships, and contributions to research organizations. Our team has extensive experience in training large transformer architectures, optimizing Hugging Face code, and selecting the right hardware for any given job.

Did You Know?

CoreWeave partnered with the open source research collective EleutherAI to develop and train the worlds largest open source NLP Model, GPT-NeoX-20B!

There are many ways to run Machine Learning tasks on CoreWeave, and, as is typical of the space, there are many methods to achieve the same result.

Frameworks and tools

Determined AI is an experiment-oriented MLOps platform featuring hyperparameter search. Experiments are run in containers on CoreWeave Kubernetes, abstracting the DevOps portions via a simple CLI and UI, and can be deployed with a single click from the CoreWeave Application Catalog. Determined AI is a great way to run large distributed training jobs, with support for most popular frameworks.

Our favorite workflow runner, Argo Workflows, can easily be tasked to train or fine-tune a model and automatically deploy an Inference endpoint when finished.

Training Operators offer a Kubernetes-native way to run distributed training jobs. Supports Tensorflow, PyTorch Distributed and any MPI style framework such as DeepSpeed.

Last updated