Run SkyPilot on SUNK - CoreWeave Docs

SkyPilot is an open source framework that provides a unified interface to run AI workloads across any infrastructure, including Slurm and SUNK. When you use SkyPilot on SUNK, you define your workload in a YAML file, and SkyPilot handles the sbatch submission, container lifecycle management through Pyxis and enroot, and multi-node coordination. This guide is for SUNK users who want a simpler workflow to submit and manage GPU workloads than to write sbatch scripts by hand. It covers how to install SkyPilot, run containerized GPU workloads, use SkyPilot for interactive development as an alternative to salloc, and manage jobs on your SUNK cluster. Consider using SkyPilot on SUNK if any of the following apply:

You want a simpler interface to submit GPU workloads without writing sbatch scripts directly.
You need to run containerized workloads across multiple nodes and want SkyPilot to handle the distributed setup.
You want an salloc-like interactive development workflow with persistent sessions you can SSH into.

Prerequisites

Before you complete the steps in this guide, be sure you have the following:

A SUNK cluster running SUNK v7.x or later with enroot 4.0.1 or later and Pyxis v0.21.0 or later.
SSH access to your Slurm login node.
Familiarity with Slurm commands, such as srun, sbatch, and squeue.
Python 3.8 or later installed locally.

Container support through SkyPilot requires SUNK v7.x or later. Earlier versions with enroot 3.5.0 have a known bug that prevents SkyPilot’s container lifecycle management from working correctly.

Install SkyPilot

Install SkyPilot locally so you can submit jobs to your SUNK cluster from your workstation. The Slurm plugin lets SkyPilot translate your YAML workload definitions into sbatch submissions.

pip install "skypilot-nightly[slurm]"

Configure SkyPilot for your SUNK cluster

To point SkyPilot at your SUNK cluster, complete the following steps:

Create the SkyPilot Slurm SSH configuration file:
mkdir -p ~/.slurm touch ~/.slurm/config
Add your SUNK login node to ~/.slurm/config:
Host [CLUSTER-NAME] HostName [LOGIN-NODE-IP] User [USERNAME] IdentityFile ~/.ssh/id_ed25519
Replace the placeholders with the following values:
- [CLUSTER-NAME]: A name for your cluster, for example my-sunk-cluster.
- [LOGIN-NODE-IP]: Your login node’s external IP address. To find it, run kubectl get svc slurm-login -n tenant-slurm.
- [USERNAME]: Your Slurm username.

Start the SkyPilot API server and verify your configuration:

sky api start
sky check

You should see output similar to the following:

Enabled infra
  Slurm [compute]
    Allowed clusters:
    └── my-sunk-cluster

With SkyPilot installed and your cluster configured, you’re ready to submit workloads.

Containerized workloads

SkyPilot uses the image_id field in your YAML to specify a container image. Internally, this translates to Pyxis --container-image and --container-name flags on the sbatch and srun commands. You don’t interact with Pyxis or enroot directly.

Run a single-node GPU task

Create a YAML file that specifies your resources and container image:

name: pytorch-gpu-test

resources:
  cloud: slurm
  accelerators: [GPU-TYPE]:[GPU-COUNT]
  image_id: docker:nvcr.io/nvidia/pytorch:24.12-py3

run: |
  python -c "
  import torch
  print(f'PyTorch {torch.__version__}')
  print(f'CUDA available: {torch.cuda.is_available()}')
  print(f'GPU count: {torch.cuda.device_count()}')
  if torch.cuda.is_available():
      print(f'GPU: {torch.cuda.get_device_name(0)}')
  "

Replace the placeholders with the following values:

[GPU-TYPE]: The GPU type available on your cluster, for example H100 or B200.
[GPU-COUNT]: The number of GPUs to request.

The following table describes the key fields:

Field	Description
`cloud: slurm`	Targets your SUNK cluster.
`accelerators`	GPU type and count. Must match a GPU type available on your cluster. Use `sinfo` on the login node to check available partitions.
`image_id`	The container image. Prefix with `docker:` for registry images. Supports Docker Hub, NGC, and GHCR.
`run`	Commands to execute inside the container.

Launch the task:

sky launch [TASK-YAML] -y

SkyPilot submits an sbatch job to your SUNK cluster, pulls and initializes the container image on the allocated node, runs the run commands inside the container, and streams logs back to your terminal.

The first launch with a given container image takes several minutes while SkyPilot pulls and caches the image. Subsequent launches reuse the cached image and start faster.

Use a setup block for dependencies

If your workload needs additional packages, use the setup field. Setup runs once when the cluster is first provisioned, before run:

name: training-job

resources:
  cloud: slurm
  accelerators: [GPU-TYPE]:8
  image_id: docker:nvcr.io/nvidia/pytorch:24.12-py3

setup: |
  pip install wandb transformers datasets

run: |
  torchrun --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE train.py

Interactive development

SkyPilot can work as an alternative to salloc for interactive GPU development. Instead of holding a terminal session open, SkyPilot provisions a persistent Slurm allocation that you can connect to over SSH and submit commands to on demand. Internally, SkyPilot submits an sbatch job with sleep infinity as the main process, which keeps the allocation alive indefinitely. SSH and sky exec then attach to that running job through srun --overlap. The allocation stays up until you explicitly tear it down with sky down.

Create a resource YAML

Create a YAML file that specifies resources but no run command:

name: dev-session

resources:
  cloud: slurm
  accelerators: [GPU-TYPE]:[GPU-COUNT]
  image_id: docker:nvcr.io/nvidia/pytorch:24.12-py3

The image_id field is optional. Without it, you get a bare-metal session on the compute node. With it, you get an interactive container that has your chosen ML framework pre-installed.

Launch the cluster and connect

To launch your interactive session and attach to it, complete the following steps:

Launch the cluster:
sky launch -c [CLUSTER-NAME] interactive.yaml -y
Replace [CLUSTER-NAME] with a name for your interactive session, for example dev.
Use SSH to log in to the running allocation:
ssh [CLUSTER-NAME]
This connects through the Slurm compute node and opens a shell in the container if you specified image_id. You can then run commands interactively, as you would inside an salloc session:
nvidia-smi -L python -c "import torch; print(torch.__version__)"
Optional: Submit batch tasks to the running allocation with sky exec, similar to running srun within an existing salloc:
sky exec [CLUSTER-NAME] task.yaml
When you’re done, release the Slurm allocation:
sky down [CLUSTER-NAME]

The following table maps common Slurm interactive commands to their SkyPilot equivalents:

Slurm built-in	SkyPilot equivalent
`salloc --gres=gpu:1`	`sky launch -c dev interactive.yaml`
`srun nvidia-smi`	`ssh dev` and run commands, or `sky exec dev task.yaml`
`exit` or `scancel`	`sky down dev`
`squeue`	`sky queue dev`

Manage jobs

After you have one or more SkyPilot clusters running on SUNK, use the following commands to inspect their state, stream logs, and release resources when you’re finished. In SkyPilot, a “cluster” refers to the named Slurm allocation that SkyPilot manages on your behalf. Use the cluster name you defined with sky launch -c [CLUSTER-NAME] in the following commands:

Command	Description
`sky status`	List all active clusters and their current status.
`sky queue [CLUSTER-NAME]`	Show all jobs on the specified cluster.
`sky logs [CLUSTER-NAME]`	Stream logs from the latest job on the specified cluster.
`sky cancel [CLUSTER-NAME] [JOB-ID]`	Cancel a specific job without releasing the allocation.
`sky down [CLUSTER-NAME]`	Tear down the cluster and release its Slurm allocation.

Known limitations

SkyPilot on SUNK has the following known limitations:

Managed jobs require consolidation mode. sky jobs launch doesn’t work with the default local API server because it requires a separate long-running controller process to monitor managed jobs. To enable managed jobs, turn on consolidation mode, which runs the jobs controller within the API server. Consolidation mode is enabled by default on remote API servers deployed with --deploy. For a local API server, you must enable it manually in ~/.sky/config.yaml:
jobs: controller: consolidation_mode: true
After you update the configuration, restart the API server for the change to take effect.
Autostop isn’t supported. Clusters remain active until you explicitly tear them down with sky down. Remember to release allocations when they’re no longer needed to avoid wasting resources.
sky exec requires explicit GPU requests for GPU visibility. When you run tasks with sky exec on container-based clusters, GPUs may not be visible unless you specify the --gpus flag. For example, sky exec -c [CLUSTER-NAME] --gpus [GPU-TYPE]:[GPU-COUNT] -- nvidia-smi -L. For interactive GPU work, you can also use ssh [CLUSTER-NAME] to connect directly to the running allocation.
SSH requires SUNK v7.x or later. On SUNK v6.x, SSH to running jobs doesn’t work due to Dropbear and enroot compatibility issues.
Container support requires SUNK v7.x or later. Earlier versions that use enroot 3.5.0 have a bug that prevents Pyxis from finding running containers. This is fixed in enroot 4.0.1 (shipped with SUNK v7.x).

​Prerequisites

​Install SkyPilot

​Configure SkyPilot for your SUNK cluster

​Containerized workloads

​Run a single-node GPU task

​Use a setup block for dependencies

​Interactive development

​Create a resource YAML

​Launch the cluster and connect

​Manage jobs

​Known limitations

Prerequisites

Install SkyPilot

Configure SkyPilot for your SUNK cluster

Containerized workloads

Run a single-node GPU task

Use a setup block for dependencies

Interactive development

Create a resource YAML

Launch the cluster and connect

Manage jobs

Known limitations