Run SkyPilot on CKS

Learn how to install and run SkyPilot on CKS

SkyPilot is an open-source framework designed to simplify, optimize, and unify the execution of AI, LLM, and batch workloads across a variety of compute infrastructure, including CoreWeave Kubernetes Service (CKS).

SkyPilot abstracts the complexities of provisioning, scheduling, and managing underlying resources, allowing users to define their jobs once and then run them easily. SkyPilot provides useful features, such as the ability to ssh into containers, fuse mount and configure access to object storage, and easily monitor execution.

Overview

This guide shows how to install and configure SkyPilot for use with CKS. It covers the initial setup and walks through several common use cases, such as creating a development cluster, testing network performance, and deploying an inference service.

Prerequisites

Before you start, you'll need to the following:

A working CKS cluster with GPU Nodes
Conda installed on your local machine
Optional:
- The ability to create CoreWeave AI Object Storage buckets and configure access permissions
- The AWS CLI installed and configured on your local machine

Install SkyPilot

To install SkyPilot, run the following commands:

Example
```
$
conda create -y -n sky python=3.10
$
conda activate sky
$
pip install "skypilot[coreweave]"
```
It's recommended to use a new conda environment to avoid package conflicts.

Note
SkyPilot requires Python 3.7 to 3.13.
To test if your environment is configured correctly, run the following command:

Example
```
$
sky check
```
You should see something like this, where your allowed context will be the name of your CoreWeave CKS cluster. For example, here, it is cks-usw04:

Example
```
🎉 Enabled infra 🎉
  CoreWeave [storage]
  Kubernetes [compute]
    Allowed contexts:
    └── cks-usw04
```

For full installation instructions and directions for installing from source, see SkyPilot's installation documentation.

Use SkyPilot to create a development cluster

To create a cluster with a specific container, we will use the simple-devpod.yaml configuration YAML. Full YAML specifications are described in the SkyPilot documentation.

Note

This example uses a development container from the CoreWeave ML team.

Launch a devpod

To launch a development cluster using the simple-devpod.yaml configuration YAML, complete the following steps:

Download the file and complete the following tasks:

Run sky show-gpus and note the GPU type in your context.
Update accelerators to your GPU type.

Example

resources:
  # Modify this below to request different resources
  accelerators: H100_NVLINK_80GB:1  # Use 1 H100
Change to your GPU type.
  image_id: docker:ghcr.io/coreweave/ml-containers/nightly-torch-extras:8b6c417-base-25110205-cuda12.9.1-ubuntu22.04-torch2.10.0a0-vision0.25.0a0-audio2.10.0a0
You can use your own container, but we recommend using a CW one because they are optimized for networking.
  memory: 32+  # Request at least 32GB of RAM

Note that the accelerators value must match the GPU shown in the output after running sky show-gpus. SkyPilot won't schedule the job if the value accelerators output from sky show-gpus do not match.

Deploy the file by running the following command:

Example
```
$
sky launch -c mysimpledevpod simple-devpod.yaml
```
Tip
If you modify any configuration settings, or if you run into problems, for example, SkyPilot cannot find the bucket, try stopping and starting the API server and then re-running the launch command:
Example
```
$
sky api stop
$
sky api start
$
sky check coreweave kubernetes
$
sky launch -c mysimpledevpod simple-devpod.yaml
```
Monitor the progress by running the following command:

Example
```
$
sky logs --provision mysimpledevpod
```
See the status of your clusters by launching the SkyPilot dashboard.

Example
```
$
sky dashboard
```
From the dashboard, refer to the commands to ssh into your cluster, connect with VSCode or Cursor, or terminate the Pod.

For example, to ssh into the Pod, run the following command:

Example
```
$
ssh mysimpledevpod
```
To stop the development machine after a period of idle time, for instance, to automatically stop the cluster after five hours, you can run the following command:

Example
```
$
sky autostop -i 300 mysimpledevpod
```
For more information on starting and stopping a development server, see the Start a Development Cluster documentation.

Optional: Set up CoreWeave AI Object Storage and launch a devpod

Click to expand: Optional—Launch a devpod with storage configured

To set up CoreWeave AI Object Storage, complete the following steps the Cloud Console:

Get AI Object Storage access keys by following the instructions at Create Access Keys. Be sure to save the Key Id and Key Secret so you can access them later.
If you don't already have one, create an AI Object Storage organization access policy by following the instructions at Manage Organization Access Policies. You must have at least one organization access policy set before you can use AI Object Storage.
Create an AI Object storage bucket by following the instructions at Create Buckets.

Configure your development environment for AI Object storage

Create a separate CoreWeave profile in a specific location to avoid conflicts with other S3-compatible services. Configure your credentials by running the following commands:

Example

$
AWS_SHARED_CREDENTIALS_FILE=~/.coreweave/cw.credentials aws configure --profile cw

When prompted, enter your CoreWeave Object Storage credentials:

Example

AWS Access Key ID [None]: <YOUR_ACCESS_KEY_ID>
AWS Secret Access Key [None]: <YOUR_SECRET_ACCESS_KEY>
Default region name [None]:
Default output format [None]: json

Configure the CoreWeave storage endpoint and set the default addressing style to virtual:

Example
```
$
AWS_CONFIG_FILE=~/.coreweave/cw.config aws configure set endpoint_url https://cwobject.com --profile cw
$
AWS_CONFIG_FILE=~/.coreweave/cw.config aws configure set s3.addressing_style virtual --profile cw
```
Endpoint selection
We use https://cwobject.com so the bucket's endpoint is accessible from anywhere and uses secure HTTPS. This lets us upload local data.
Always use http://cwlota.com if you do not need to upload local data to the bucket. The LOTA endpoint provides significantly faster access within CoreWeave's network. Refer to LOTA documentation for more details.
If you want to use LOTA, run the following commands:
Example
```
$
AWS_CONFIG_FILE=~/.coreweave/cw.config aws configure set endpoint_url http://cwlota.com --profile cw
$
AWS_CONFIG_FILE=~/.coreweave/cw.config aws configure set s3.addressing_style virtual --profile cw
```
To test if your environment is configured correctly, run the following command:

Example
```
$
sky check
```
You should see something like this, where your allowed context will be the name of your CoreWeave CKS cluster. For example, here, it is cks-usw04:

Example
```
🎉 Enabled infra 🎉
  CoreWeave [storage]
  Kubernetes [compute]
    Allowed contexts:
    └── cks-usw04
```

For full installation instructions and directions for installing from source, see SkyPilot's installation documentation.

In this example, the specified CoreWeave AI Object Storage bucket that you define will be fuse mounted on /my_data. The code that you place in the directory my-code will be copied to ~/sky_workdir on the container.

We also install the AWS CLI and configure it to find configuration files and credentials in the relevant CoreWeave locations, enabling AWS S3 API access to CoreWeave storage. Note that this is only for convenience; to access files using LOTA cache acceleration, use the LOTA endpoint and S3 interface.

Launch a devpod with storage configured

To launch a development cluster with storage configured, use the mydevpod.yaml configuration YAML and complete the following steps:

Download the file and modify the accelerators and source fields:

Run sky show-gpus and note the GPU type in your context.
Update accelerators to your GPU type.

Example

resources:
  # Modify this below to request different resources
  accelerators: H100_NVLINK_80GB:1  # Use 1 H100
Change to your GPU type.
  image_id: docker:ghcr.io/coreweave/ml-containers/nightly-torch-extras:8b6c417-base-25110205-cuda12.9.1-ubuntu22.04-torch2.10.0a0-vision0.25.0a0-audio2.10.0a0
  memory: 32+  # Request at least 32GB of RAM

file_mounts:
  /my_data: # Mount storage bucket to /my_data in the container
    source: cw://skypilot # Change this to be your bucket name
Change to your bucket name.
    mode: MOUNT  # MOUNT or COPY or MOUNT_CACHED. Defaults to MOUNT. Optional.
# Sync data in my-code/ on local machine to ~/sky_workdir in the container
workdir: ./my-code
Be sure this directory is created on your local machine.

Be sure to create a directory on your local machine called my-code.

Deploy the file by running the following command:

Example
```
$
sky launch -c mydevpod mydevpod.yaml
```
Tip
If you run into problems, for example, SkyPilot cannot find the bucket, try stopping and starting the API server and then re-running the launch command:
Example
```
$
sky api stop
$
sky api start
$
sky check coreweave kubernetes
$
sky launch -c mydevpod mydevpod.yaml
```
Monitor the progress by running the following command:

Example
```
$
sky logs --provision mydevpod
```
See the status of your clusters by launching the SkyPilot dashboard.

Example
```
$
sky dashboard
```
For more information on starting and stopping a development server, see the Start a Development Cluster documentation.

Test network performance

CoreWeave network fabric is optimized for high performance workloads, and CoreWeave InfiniBand support for SkyPilot is configured automatically by specifying the network_tier: best configuration option.

To test network performance for collective operations, you can use the CoreWeave-specific SkyPilot example as a framework. This example deploys one of the CoreWeave example images that is configured with tested drivers and software to be a base for any HPC application. For more information on CoreWeave NCCL tests, see the nccl-test GitHub repository.

Deploy an inference service

You can deploy an inference service on Kubernetes using SkyPilot. The following example uses a CoreWeave configuration YAML for vllm.

To launch the inference service, download the vllm.yaml file and modify the accelerators field to match your GPU type:

Example
```
accelerators: H100_NVLINK_80GB:1  # Use 1 H100
Change to your GPU type.
```
Launch the service by running the following command:

Example
```
$
sky launch -c vllm-test vllm.yaml
```
To obtain the service endpoint, run the following command:

Example
```
$
ENDPOINT=$(sky status --endpoint 8000 vllm-test)
```

To access the service, run the following command:

Example

$
curl http://${ENDPOINT}/v1/completions  -H "Content-Type: application/json" \
  -d '{
    "model": "facebook/opt-125m",
    "prompt": "Once upon a time, there lived a princess who",
    "max_tokens": 20
  }'

To scale the inference service, use the sky serve command:

Example
```
$
sky serve up -n vllm-serve vllm.yaml
```

Many other serving examples are available at the SkyPilot documentation.

Deploying a multinode training job

To launch a multinode training job, you can use the CoreWeave distributed training configuration YAML that's based on the SkyPilot Distributed Training with Pytorch example.

The CoreWeave example specifies using network_tier: best, which automatically configures optimal InfiniBand support and is configured with tested drivers and software as a base for a high performance training jobs.

When using the distributed_training.yaml, be sure to change the accelerators field to match your GPU type:

Example

accelerators: H100_NVLINK_80GB:1  # Use 1 H100
Change to your GPU type.

Set up a production API server

All of the examples above assume clusters are isolated to individual users. If your team decides to use SkyPilot to share resources, follow the instructions to create a SkyPilot API Server.

Next steps

To learn more about SkyPilot, see the SkyPilot documentation.

Overview​

Prerequisites​

Install SkyPilot​

Use SkyPilot to create a development cluster​

Launch a devpod​

Optional: Set up CoreWeave AI Object Storage and launch a devpod​

Configure your development environment for AI Object storage​

Launch a devpod with storage configured​

Test network performance​

Deploy an inference service​

Deploying a multinode training job​

Set up a production API server​

Next steps​

Overview

Prerequisites

Install SkyPilot

Use SkyPilot to create a development cluster

Launch a devpod

Optional: Set up CoreWeave AI Object Storage and launch a devpod

Configure your development environment for AI Object storage

Launch a devpod with storage configured

Test network performance

Deploy an inference service

Deploying a multinode training job

Set up a production API server

Next steps