ssh into containers, fuse mount and configure access to object storage, and monitor execution.
Overview
This guide shows how to install and configure SkyPilot for use with CKS. It covers the initial setup and walks through several common use cases. Examples include creating a development cluster, testing network performance, and deploying an inference service.Prerequisites
Before you start, you’ll need the following:- A working CKS cluster with GPU Nodes.
- Conda installed on your local machine.
- Optional: The ability to create CoreWeave AI Object Storage buckets and configure access permissions.
- Optional: The AWS CLI installed and configured on your local machine.
Install SkyPilot
Install the SkyPilot CLI in an isolated Conda environment, then verify that SkyPilot can detect your CKS cluster.-
Create an isolated Conda environment and install the SkyPilot CLI with CoreWeave support to avoid package conflicts:
- CKS requires SkyPilot version 0.10.1 or later.
- SkyPilot requires Python 3.7 to 3.13.
-
Verify that SkyPilot detects your CKS cluster:
The output lists your allowed contexts, which are the names of your CoreWeave CKS clusters. In this example, the context is
cks-usw04:
Use SkyPilot to create a development cluster
This section walks through launching an interactive development Pod (“devpod”) on CKS so you can iterate on code in a GPU-backed container. To create a cluster with a specific container, use the simple-devpod.yaml configuration YAML. Full YAML specifications are described in the SkyPilot YAML specification reference.This example uses a development container from the CoreWeave ML team.
Launch a devpod
To launch a development cluster using the simple-devpod.yaml configuration YAML, complete the following steps:-
Download the file and complete the following tasks:
- Run
sky show-gpusand note the GPU type in your context. - Update
acceleratorsto your GPU type.
Theacceleratorsvalue must match the GPU shown in the output after runningsky show-gpus. SkyPilot won’t schedule the job if theacceleratorsvalue and the output fromsky show-gpusdon’t match. - Run
-
Deploy the configuration to launch the devpod:
-
Stream the provisioning logs to monitor progress:
-
Launch the SkyPilot dashboard to see the status of your clusters:
From the dashboard, use the commands to
sshinto your cluster, connect with VSCode or Cursor, or terminate the Pod. For example, tosshinto the Pod:To automatically stop the development machine after a period of idle time, for example, after five hours:For more information about starting and stopping a development server, see SkyPilot’s interactive development guide.
Optional: Set up CoreWeave AI Object Storage and launch a devpod
Optional: Launch a devpod with storage configured
Optional: Launch a devpod with storage configured
To set up CoreWeave AI Object Storage, complete the following steps in the Cloud Console:
-
Get AI Object Storage access keys by following the instructions to create access keys. Be sure to save the
Key IdandKey Secretso you can access them later. - If you don’t already have one, create an AI Object Storage organization access policy by following the instructions to manage organization access policies. You must have at least one organization access policy set before you can use AI Object Storage.
- Create an AI Object Storage bucket by following the instructions to create a bucket.
Configure your development environment for AI Object Storage
-
Create a separate CoreWeave profile in a specific location to avoid conflicts with other S3-compatible services:
When prompted, enter your CoreWeave Object Storage credentials. Replace
[ACCESS-KEY-ID]with your access key ID and[SECRET-ACCESS-KEY]with your secret access key. -
Configure the CoreWeave storage endpoint and set the default addressing style to virtual:
Endpoint selectionUse
https://cwobject.comso the bucket’s endpoint is accessible from anywhere and uses secure HTTPS. This endpoint supports uploading local data.Always usehttp://cwlota.comif you don’t need to upload local data to the bucket. The LOTA endpoint provides faster access within CoreWeave’s network. For more information, see About LOTA.To configure the LOTA endpoint: -
Verify that SkyPilot detects your CKS cluster:
The output lists your allowed contexts, which are the names of your CoreWeave CKS clusters. In this example, the context is
cks-usw04:
/my_data. The code in the my-code directory is copied to ~/sky_workdir on the container.This example also installs the AWS CLI and configures it to find configuration files and credentials in the relevant CoreWeave locations, enabling AWS S3 API access to CoreWeave storage. This is only for convenience. To access files using LOTA cache acceleration, use the LOTA endpoint and S3 interface.Launch a devpod with storage configured
To launch a development cluster with storage configured, use the mydevpod.yaml configuration YAML and complete the following steps:-
Download the file and modify the
acceleratorsandsourcefields:- Run
sky show-gpusand note the GPU type in your context. - Update
acceleratorsto your GPU type.
Be sure to create a directory on your local machine calledmy-code. - Run
-
Deploy the configuration to launch the devpod:
-
Stream the provisioning logs to monitor progress:
-
Launch the SkyPilot dashboard to see the status of your clusters:
For more information about starting and stopping a development server, see SkyPilot’s interactive development guide.
Test network performance
After your devpod is running, you can validate that the high-performance network fabric is working as expected before scaling out to multi-Node workloads. The CoreWeave network fabric is optimized for high-performance workloads, and CoreWeave InfiniBand support for SkyPilot is configured automatically when you specify thenetwork_tier: best configuration option.
To test network performance for collective operations, you can use the CoreWeave-specific SkyPilot example as a framework. This example deploys one of the CoreWeave example images that is configured with tested drivers and software to be a base for any HPC application. For more information about CoreWeave NCCL tests, see the nccl-test GitHub repository.
Deploy an inference service
This section shows how to move from interactive development to serving a model behind an HTTP endpoint using SkyPilot’s launch and serve workflows. You can deploy an inference service on Kubernetes using SkyPilot. The following example uses a CoreWeave configuration YAML for vllm.-
To launch the inference service, download the vllm.yaml file and modify the
acceleratorsfield to match your GPU type: -
Launch the service:
-
Capture the service endpoint in an environment variable:
-
Send a test request to the service:
-
To scale the inference service, use the
sky servecommand:
Deploy a multinode training job
For training workloads that span multiple Nodes, SkyPilot can launch a distributed job that takes advantage of CoreWeave’s InfiniBand fabric. To launch a multinode training job, you can use the CoreWeave distributed training configuration YAML that’s based on the SkyPilot distributed training with PyTorch example. The CoreWeave example specifies usingnetwork_tier: best, which automatically configures optimal InfiniBand support and is configured with tested drivers and software as a base for high-performance training jobs.
When using the distributed_training.yaml, be sure to change the accelerators field to match your GPU type: