Model Training and Fine-tuning
Welcome to Machine Learning on CoreWeave Cloud
Training Machine Learning models, especially models of modern Deep Neural Networks, is at the center of CoreWeave Cloud's architecture. The entire CoreWeave Cloud stack is purpose-built to enable highly scalable, cost-efficient model training.
In addition to its core tech stack, CoreWeave has a history of supporting our customers with cutting-edge Machine Learning research through in-house expertise, industry partnerships, and contributions to research organizations. Our team has extensive experience in training large transformer architectures, optimizing Hugging Face code, and selecting the right hardware for any given job.
There are many ways to run Machine Learning tasks on CoreWeave, and, as is typical of the space, there are many methods to achieve the same result.
Determined AI is an experiment-oriented MLOps platform featuring hyperparameter search. Experiments are run in containers on CoreWeave Kubernetes, abstracting the DevOps portions via a simple CLI and UI, and can be deployed with a single click from the CoreWeave Application Catalog. Determined AI is a great way to run large distributed training jobs, with support for most popular frameworks.
Training Operators offer a Kubernetes-native way to run distributed training jobs. Supports Tensorflow, PyTorch Distributed and any MPI style framework such as DeepSpeed.