SkyPilot is an open-source framework that provides a unified interface for running AI workloads across any infrastructure, including Slurm and SUNK. By using SkyPilot on SUNK, you define your workload in a YAML file and let SkyPilot handle theDocumentation Index
Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
Use this file to discover all available pages before exploring further.
sbatch submission, container lifecycle management through Pyxis and enroot, and multi-node coordination.
This guide covers installing SkyPilot, running containerized GPU workloads, using SkyPilot for interactive development as an alternative to salloc, and managing jobs on your SUNK cluster.
Consider using SkyPilot on SUNK if any of the following apply:
-
You want a simpler interface for submitting GPU workloads without writing
sbatchscripts directly. - You need to run containerized workloads across multiple nodes and want SkyPilot to handle distributed setup.
-
You want an
salloc-like interactive development workflow with persistent sessions you can SSH into.
Prerequisites
Before completing the steps in this guide, be sure you have the following:- A SUNK cluster running SUNK v7.x or later with enroot 4.0.1+ and Pyxis v0.21.0+.
- SSH access to your Slurm login node.
- Familiarity with Slurm commands, such as
srun,sbatch, andsqueue. - Python 3.8 or later installed locally.
Install SkyPilot
Install SkyPilot with the Slurm plugin:Configure SkyPilot for your SUNK cluster
-
Create the SkyPilot Slurm SSH configuration file:
-
Add your SUNK login node to
~/.slurm/config:[CLUSTER-NAME]: A name for your cluster, for examplemy-sunk-cluster.[LOGIN-NODE-IP]: Your login node’s external IP address. To find it, runkubectl get svc slurm-login -n tenant-slurm.[USERNAME]: Your Slurm username.
-
Start the SkyPilot API server and verify your configuration:
You should see output similar to the following:
Containerized workloads
SkyPilot uses theimage_id field in your YAML to specify a container image. Under the hood, this translates to Pyxis --container-image and --container-name flags on the sbatch and srun commands. You do not interact with Pyxis or enroot directly.
Run a single-node GPU task
Create a YAML file that specifies your resources and container image:[GPU-TYPE]: The GPU type available on your cluster, for exampleH100orB200.[GPU-COUNT]: The number of GPUs to request.
| Field | Description |
|---|---|
cloud: slurm | Targets your SUNK cluster. |
accelerators | GPU type and count. Must match a GPU type available on your cluster. Use sinfo on the login node to check available partitions. |
image_id | The container image. Prefix with docker: for registry images. Supports Docker Hub, NGC, and GHCR. |
run | Commands to execute inside the container. |
sbatch job to your SUNK cluster, pulls and initializes the container image on the allocated node, executes the run commands inside the container, and streams logs back to your terminal.
The first launch with a given container image takes several minutes as the image is pulled and cached. Subsequent launches reuse the cached image and start faster.
Use a setup block for dependencies
If your workload needs additional packages, use thesetup field. Setup runs once when the cluster is first provisioned, before run:
Interactive development
SkyPilot can function as an alternative tosalloc for interactive GPU development. Instead of holding a terminal session open, SkyPilot provisions a persistent Slurm allocation that you can SSH into and submit commands to on demand.
Under the hood, SkyPilot submits a sbatch job with sleep infinity as the main process, keeping the allocation alive indefinitely. SSH and sky exec then attach to that running job through srun --overlap. The allocation stays up until you explicitly tear it down with sky down.
Create a resource YAML
Create a YAML file that specifies resources but norun command:
The
image_id field is optional. Without it, you get a bare-metal session on the compute node. With it, you get an interactive container with your chosen ML framework pre-installed.Launch the cluster and connect
-
Launch the cluster:
[CLUSTER-NAME]: A name for your interactive session, for exampledev.
-
SSH into the running allocation:
This connects through the Slurm compute node and drops you into the container (if
image_idwas specified). From here, you can run commands interactively, just as you would inside ansallocsession: -
Optionally, submit batch tasks to the running allocation with
sky exec, similar to runningsrunwithin an existingsalloc: -
When you are done, release the Slurm allocation:
| Slurm native | SkyPilot equivalent |
|---|---|
salloc --gres=gpu:1 | sky launch -c dev interactive.yaml |
srun nvidia-smi | ssh dev and run commands, or sky exec dev task.yaml |
exit or scancel | sky down dev |
squeue | sky queue dev |
Manage jobs
In SkyPilot, a “cluster” refers to the named Slurm allocation that SkyPilot manages on your behalf. Use the cluster name you defined withsky launch -c [CLUSTER-NAME] in the following commands:
| Command | Description |
|---|---|
sky status | List all active clusters and their current status. |
sky queue [CLUSTER-NAME] | Show all jobs on the specified cluster. |
sky logs [CLUSTER-NAME] | Stream logs from the latest job on the specified cluster. |
sky cancel [CLUSTER-NAME] [JOB-ID] | Cancel a specific job without releasing the allocation. |
sky down [CLUSTER-NAME] | Tear down the cluster and release its Slurm allocation. |
Known limitations
-
Managed jobs require consolidation mode.
sky jobs launchdoes not work with the default local API server because it requires a separate long-running controller process to monitor managed jobs. To enable managed jobs, turn on consolidation mode, which runs the jobs controller within the API server. Consolidation mode is enabled by default on remote API servers deployed with--deploy, but for a local API server you must enable it manually in~/.sky/config.yaml:After updating the configuration, restart the API server for the change to take effect. -
Autostop is not supported. Clusters remain active until you explicitly tear them down with
sky down. Remember to release allocations when they are no longer needed to avoid wasting resources. -
sky execrequires explicit GPU requests for GPU visibility. When running tasks withsky execon container-based clusters, GPUs may not be visible unless you specify the--gpusflag. For example,sky exec -c [CLUSTER-NAME] --gpus [GPU-TYPE]:[GPU-COUNT] -- nvidia-smi -L. For interactive GPU work, you can also usessh [CLUSTER-NAME]to connect directly to the running allocation. - SSH requires SUNK v7.x or later. On SUNK v6.x, SSH into running jobs does not work due to Dropbear and enroot compatibility issues.
- Container support requires SUNK v7.x or later. Earlier versions using enroot 3.5.0 have a bug that prevents Pyxis from finding running containers. This is fixed in enroot 4.0.1 (shipped with SUNK v7.x).