SkyPilot is an open-source framework designed to simplify, optimize, and unify the execution of AI, LLM, and batch workloads across a variety of compute infrastructure, including CoreWeave Kubernetes Service (CKS). SkyPilot abstracts the complexities of provisioning, scheduling, and managing underlying resources, allowing users to define their jobs once and then run them easily. SkyPilot provides useful features, such as the ability toDocumentation Index
Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
Use this file to discover all available pages before exploring further.
ssh into containers, fuse mount and configure access to object storage, and easily monitor execution.
Overview
This guide shows how to install and configure SkyPilot for use with CKS. It covers the initial setup and walks through several common use cases, such as creating a development cluster, testing network performance, and deploying an inference service.Prerequisites
Before you start, you’ll need to the following:- A working CKS cluster with GPU Nodes
- Conda installed on your local machine
- Optional:
- The ability to create CoreWeave AI Object Storage buckets and configure access permissions
- The AWS CLI installed and configured on your local machine
Install SkyPilot
-
To install SkyPilot, run the following commands:
It’s recommended to use a new conda environment to avoid package conflicts.
- CKS requires SkyPilot version 0.10.1 or later.
- SkyPilot requires Python 3.7 to 3.13.
-
To test if your environment is configured correctly, run the following command:
You should see something like this, where your allowed context will be the name of your CoreWeave CKS cluster. For example, here, it is
cks-usw04:
Use SkyPilot to create a development cluster
To create a cluster with a specific container, we will use the simple-devpod.yaml configuration YAML. Full YAML specifications are described in the SkyPilot documentation.This example uses a development container from the CoreWeave ML team.
Launch a devpod
To launch a development cluster using the simple-devpod.yaml configuration YAML, complete the following steps:-
Download the file and complete the following tasks:
- Run
sky show-gpusand note the GPU type in your context. - Update
acceleratorsto your GPU type.
Note that theacceleratorsvalue must match the GPU shown in the output after runningsky show-gpus. SkyPilot won’t schedule the job if the valueacceleratorsoutput fromsky show-gpusdo not match. - Run
-
Deploy the file by running the following command:
-
Monitor the progress by running the following command:
-
See the status of your clusters by launching the SkyPilot dashboard.
From the dashboard, refer to the commands to
sshinto your cluster, connect with VSCode or Cursor, or terminate the Pod. For example, tosshinto the Pod, run the following command:To stop the development machine after a period of idle time, for instance, to automatically stop the cluster after five hours, you can run the following command:For more information on starting and stopping a development server, see the Start a Development Cluster documentation.
Optional: Set up CoreWeave AI Object Storage and launch a devpod
Optional: Launch a devpod with storage configured
Optional: Launch a devpod with storage configured
To set up CoreWeave AI Object Storage, complete the following steps the Cloud Console:
-
Get AI Object Storage access keys by following the instructions at Create Access Keys. Be sure to save the
Key IdandKey Secretso you can access them later. - If you don’t already have one, create an AI Object Storage organization access policy by following the instructions at Manage Organization Access Policies. You must have at least one organization access policy set before you can use AI Object Storage.
- Create an AI Object storage bucket by following the instructions at Create Buckets.
Configure your development environment for AI Object storage
-
Create a separate CoreWeave profile in a specific location to avoid conflicts with other S3-compatible services. Configure your credentials by running the following commands:
When prompted, enter your CoreWeave Object Storage credentials:
-
Configure the CoreWeave storage endpoint and set the default addressing style to virtual:
Endpoint selectionWe use
https://cwobject.comso the bucket’s endpoint is accessible from anywhere and uses secure HTTPS. This lets us upload local data.Always usehttp://cwlota.comif you do not need to upload local data to the bucket. The LOTA endpoint provides significantly faster access within CoreWeave’s network. Refer to LOTA documentation for more details.If you want to use LOTA, run the following commands: -
To test if your environment is configured correctly, run the following command:
You should see something like this, where your allowed context will be the name of your CoreWeave CKS cluster. For example, here, it is
cks-usw04:
/my_data. The code that you place in the directory my-code will be copied to ~/sky_workdir on the container.We also install the AWS CLI and configure it to find configuration files and credentials in the relevant CoreWeave locations, enabling AWS S3 API access to CoreWeave storage. Note that this is only for convenience; to access files using LOTA cache acceleration, use the LOTA endpoint and S3 interface.Launch a devpod with storage configured
To launch a development cluster with storage configured, use the mydevpod.yaml configuration YAML and complete the following steps:-
Download the file and modify the
acceleratorsandsourcefields:- Run
sky show-gpusand note the GPU type in your context. - Update
acceleratorsto your GPU type.
Be sure to create a directory on your local machine calledmy-code. - Run
-
Deploy the file by running the following command:
-
Monitor the progress by running the following command:
-
See the status of your clusters by launching the SkyPilot dashboard.
For more information on starting and stopping a development server, see the Start a Development Cluster documentation.
Test network performance
CoreWeave network fabric is optimized for high performance workloads, and CoreWeave InfiniBand support for SkyPilot is configured automatically by specifying thenetwork_tier: best configuration option.
To test network performance for collective operations, you can use the CoreWeave-specific SkyPilot example as a framework. This example deploys one of the CoreWeave example images that is configured with tested drivers and software to be a base for any HPC application. For more information on CoreWeave NCCL tests, see the nccl-test GitHub repository.
Deploy an inference service
You can deploy an inference service on Kubernetes using SkyPilot. The following example uses a CoreWeave configuration YAML for vllm.-
To launch the inference service, download the vllm.yaml file and modify the
acceleratorsfield to match your GPU type: -
Launch the service by running the following command:
-
To obtain the service endpoint, run the following command:
-
To access the service, run the following command:
-
To scale the inference service, use the
sky servecommand:
Deploying a multinode training job
To launch a multinode training job, you can use the CoreWeave distributed training configuration YAML that’s based on the SkyPilot Distributed Training with Pytorch example. The CoreWeave example specifies usingnetwork_tier: best, which automatically configures optimal InfiniBand support and is configured with tested drivers and software as a base for a high performance training jobs.
When using the distributed_training.yaml, be sure to change the accelerators field to match your GPU type: