Install Determined AI

How to install via the Applications Catalog on CoreWeave Cloud

Determined AI is an open-source deep learning training platform that makes building models fast and easy. Determined AI can now be deployed directly onto CoreWeave Cloud by deploying the application from the application Catalog.

To install Determined AI onto CoreWeave Cloud, perform the following steps.

Prerequisites

It is recommended that prior to installation and setup of Determined AI itself, the following steps are completed.

  1. Create a shared filesystem volume for weights and training data

  2. Install FileBrowser for filesystem navigation via a Web interface

  3. Create an Object Storage bucket for model checkpoint storage

Once the above have been completed, then the Determined AI application may be installed.

Configure your CoreWeave credentials

This guide presumes that you have an active CoreWeave Cloud account, and have obtained and locally configured your CoreWeave access credentials.

Create a shared filesystem volume

Create a shared filesystem volume by navigating to Storage Volumes in the Cloud UI. This model will be used to store weights and training data for fine-tuning.

CoreWeave's shared filesystem volumes can also be accessed by many Nodes simultaneously, allowing massive amounts of compute power to access the same dataset.

For this tutorial, the following values are used when creating the storage volume. If needed, it is easy to increase the size of a storage volume later.

Field nameDemo value

Volume Name

finetune-gpt-neox

Region

LAS1

Disk Class

HDD

Storage Type

Shared Filesystem

Size (Gi)

1,000

Labels

None

Install FileBrowser

The FileBrowser application allows files to be transferred to and from shared filesystem volumes through a Web interface. While installing FileBrowser is optional, it is recommended to make navigating the filesystem easier. It is alternatively possible to use a Virtual Server or Kubernetes Pod to interact with the shared filesystem volume through SSH or another mechanism. Such configuration is beyond the scope of this tutorial.

Additional Resources

For complete instructions on installing and configuring FileBrowser, see the FileBrowser installation guide.

While configuring the FileBrowser application, ensure the new filesystem storage volume has been attached to the FileBrowser application as shown below, then click the Deploy button.

Create an Object Storage bucket

Most Determined AI applications require Object Storage buckets to store model checkpoints, while a few - such as Jupyter Notebooks - can run without a bucket.

Unless you are sure your application will not require one, it is recommended to create an Object Storage bucket. Make note of the Access Key and Secret Key values, provided in the generated configuration file.

Install Determined AI

Once the previous steps have been completed, navigate to the Application Catalog, then search for determined to locate the Determined AI (determined) application. Click on its resulting card to configure its installation, then click Deploy.

The configuration screen will prompt for a name. Give the application a memorable name.

Resource Pools

On CoreWeave Cloud, Resource Pools are groups of hardware selections plus memory requests and limits that make it simple to select resource groups for Determined AI deployments.

The Resource Pool configuration field corresponds to the resource_pool field in the Determined AI Kubernetes Deployment.

This allows for users to avoid needing to patch each experiment with a spec or to request resources on CoreWeave infrastructure. This setting may be overridden within the Deployment if needed.

This requires explicitly setting the following fields in your experiment configuration:

resources:
  resource_pool: <GPU_RESOURCE_POOL>

Example:

name: fashion_mnist_tf_keras_const
resources:
  resource_pool: A40
hyperparameters:
  global_batch_size: 32
  dense1: 128
records_per_epoch: 60000
searcher:
  name: single
  metric: val_accuracy
  smaller_is_better: false
  max_length:
    epochs: 5
entrypoint: model_def:FashionMNISTTrial

At this time, the following Resource Pools correspond to the GPU types and amounts each of which have 8 GPUS per node.

Note

For more information on Node Types, see Node Types.

Resource Pool nameHardware typeCPU amountMemory

A40

NVIDIA A40

64

512 Gi

RTX_A5000

RTX A5000

32

200 Gi

RTX_A6000

RTX A6000

32

200 Gi

A100_NVLINK

A100 HGX

96

768 GI

A100_NVLINK_80GB

A100 HGX

96

768 Gi

H100_NVLINK_80GB

H100 HGX

96

768 Gi

This tutorial uses the Resource Pool A40.

Configure Determined AI

In the Object Storage Configuration section, set your Object Storage bucket values, including theACCESS_KEY and SECRET_KEY as obtained above. Object storage is required if your experiment will be storing model checkpoints.

Some values, such as a link to the cluster, may be important for certain applications. Those details can be found in the post-deployment notes after the application is running. If you need to access these notes again, navigate to the Applications tab, then click the Determined application tile.

At the bottom of the configuration screen, ensure that the newly-created filesystem volume is attached as shown below.

Important

It is highly recommended to set the Data Center Region of the application to be the same as that in which the shared filesystem volume was deployed.

Finally, click Deploy to launch the application.

Access Determined AI

After the application is in a Ready state, navigate to the Ingress URL provided in the post-launch notes and use the login information provided.

Note

The client is configured to communicate with the server via the environment variable $DET_MASTER.

At the Determined AI home screen, you can launch a JupyterLab and subsequent Jupyter Notebooks, or perform model fine-tuning with GPT DeepSpeed, GPT-NeoX, or Hugging Face.

Important

The default username for the Determined application is admin, and there is no default password set. Make sure to add a password after logging into the application for the first time.

Additional reading

For more information about Determined AI, see:

Last updated