Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt

Use this file to discover all available pages before exploring further.

SkyPilot is an open-source framework that provides a unified interface for running AI workloads across any infrastructure, including Slurm and SUNK. By using SkyPilot on SUNK, you define your workload in a YAML file and let SkyPilot handle the sbatch submission, container lifecycle management through Pyxis and enroot, and multi-node coordination. This guide covers installing SkyPilot, running containerized GPU workloads, using SkyPilot for interactive development as an alternative to salloc, and managing jobs on your SUNK cluster. Consider using SkyPilot on SUNK if any of the following apply:
  • You want a simpler interface for submitting GPU workloads without writing sbatch scripts directly.
  • You need to run containerized workloads across multiple nodes and want SkyPilot to handle distributed setup.
  • You want an salloc-like interactive development workflow with persistent sessions you can SSH into.

Prerequisites

Before completing the steps in this guide, be sure you have the following:
  • A SUNK cluster running SUNK v7.x or later with enroot 4.0.1+ and Pyxis v0.21.0+.
  • SSH access to your Slurm login node.
  • Familiarity with Slurm commands, such as srun, sbatch, and squeue.
  • Python 3.8 or later installed locally.
Container support through SkyPilot requires SUNK v7.x or later. Earlier versions with enroot 3.5.0 have a known bug that prevents SkyPilot’s container lifecycle management from working correctly.

Install SkyPilot

Install SkyPilot with the Slurm plugin:
pip install "skypilot-nightly[slurm]"

Configure SkyPilot for your SUNK cluster

  1. Create the SkyPilot Slurm SSH configuration file:
    mkdir -p ~/.slurm
    touch ~/.slurm/config
    
  2. Add your SUNK login node to ~/.slurm/config:
    Host [CLUSTER-NAME]
        HostName [LOGIN-NODE-IP]
        User [USERNAME]
        IdentityFile ~/.ssh/id_ed25519
    
    • [CLUSTER-NAME]: A name for your cluster, for example my-sunk-cluster.
    • [LOGIN-NODE-IP]: Your login node’s external IP address. To find it, run kubectl get svc slurm-login -n tenant-slurm.
    • [USERNAME]: Your Slurm username.
  3. Start the SkyPilot API server and verify your configuration:
    sky api start
    sky check
    
    You should see output similar to the following:
    Enabled infra
      Slurm [compute]
        Allowed clusters:
        └── my-sunk-cluster
    

Containerized workloads

SkyPilot uses the image_id field in your YAML to specify a container image. Under the hood, this translates to Pyxis --container-image and --container-name flags on the sbatch and srun commands. You do not interact with Pyxis or enroot directly.

Run a single-node GPU task

Create a YAML file that specifies your resources and container image:
name: pytorch-gpu-test

resources:
  cloud: slurm
  accelerators: [GPU-TYPE]:[GPU-COUNT]
  image_id: docker:nvcr.io/nvidia/pytorch:24.12-py3

run: |
  python -c "
  import torch
  print(f'PyTorch {torch.__version__}')
  print(f'CUDA available: {torch.cuda.is_available()}')
  print(f'GPU count: {torch.cuda.device_count()}')
  if torch.cuda.is_available():
      print(f'GPU: {torch.cuda.get_device_name(0)}')
  "
  • [GPU-TYPE]: The GPU type available on your cluster, for example H100 or B200.
  • [GPU-COUNT]: The number of GPUs to request.
Key fields:
FieldDescription
cloud: slurmTargets your SUNK cluster.
acceleratorsGPU type and count. Must match a GPU type available on your cluster. Use sinfo on the login node to check available partitions.
image_idThe container image. Prefix with docker: for registry images. Supports Docker Hub, NGC, and GHCR.
runCommands to execute inside the container.
Launch the task:
sky launch [TASK-YAML] -y
SkyPilot submits an sbatch job to your SUNK cluster, pulls and initializes the container image on the allocated node, executes the run commands inside the container, and streams logs back to your terminal.
The first launch with a given container image takes several minutes as the image is pulled and cached. Subsequent launches reuse the cached image and start faster.

Use a setup block for dependencies

If your workload needs additional packages, use the setup field. Setup runs once when the cluster is first provisioned, before run:
name: training-job

resources:
  cloud: slurm
  accelerators: [GPU-TYPE]:8
  image_id: docker:nvcr.io/nvidia/pytorch:24.12-py3

setup: |
  pip install wandb transformers datasets

run: |
  torchrun --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE train.py

Interactive development

SkyPilot can function as an alternative to salloc for interactive GPU development. Instead of holding a terminal session open, SkyPilot provisions a persistent Slurm allocation that you can SSH into and submit commands to on demand. Under the hood, SkyPilot submits a sbatch job with sleep infinity as the main process, keeping the allocation alive indefinitely. SSH and sky exec then attach to that running job through srun --overlap. The allocation stays up until you explicitly tear it down with sky down.

Create a resource YAML

Create a YAML file that specifies resources but no run command:
name: dev-session

resources:
  cloud: slurm
  accelerators: [GPU-TYPE]:[GPU-COUNT]
  image_id: docker:nvcr.io/nvidia/pytorch:24.12-py3
The image_id field is optional. Without it, you get a bare-metal session on the compute node. With it, you get an interactive container with your chosen ML framework pre-installed.

Launch the cluster and connect

  1. Launch the cluster:
    sky launch -c [CLUSTER-NAME] interactive.yaml -y
    
    • [CLUSTER-NAME]: A name for your interactive session, for example dev.
  2. SSH into the running allocation:
    ssh [CLUSTER-NAME]
    
    This connects through the Slurm compute node and drops you into the container (if image_id was specified). From here, you can run commands interactively, just as you would inside an salloc session:
    nvidia-smi -L
    python -c "import torch; print(torch.__version__)"
    
  3. Optionally, submit batch tasks to the running allocation with sky exec, similar to running srun within an existing salloc:
    sky exec [CLUSTER-NAME] task.yaml
    
  4. When you are done, release the Slurm allocation:
    sky down [CLUSTER-NAME]
    
The following table maps common Slurm interactive commands to their SkyPilot equivalents:
Slurm nativeSkyPilot equivalent
salloc --gres=gpu:1sky launch -c dev interactive.yaml
srun nvidia-smissh dev and run commands, or sky exec dev task.yaml
exit or scancelsky down dev
squeuesky queue dev

Manage jobs

In SkyPilot, a “cluster” refers to the named Slurm allocation that SkyPilot manages on your behalf. Use the cluster name you defined with sky launch -c [CLUSTER-NAME] in the following commands:
CommandDescription
sky statusList all active clusters and their current status.
sky queue [CLUSTER-NAME]Show all jobs on the specified cluster.
sky logs [CLUSTER-NAME]Stream logs from the latest job on the specified cluster.
sky cancel [CLUSTER-NAME] [JOB-ID]Cancel a specific job without releasing the allocation.
sky down [CLUSTER-NAME]Tear down the cluster and release its Slurm allocation.

Known limitations

  • Managed jobs require consolidation mode. sky jobs launch does not work with the default local API server because it requires a separate long-running controller process to monitor managed jobs. To enable managed jobs, turn on consolidation mode, which runs the jobs controller within the API server. Consolidation mode is enabled by default on remote API servers deployed with --deploy, but for a local API server you must enable it manually in ~/.sky/config.yaml:
    jobs:
      controller:
        consolidation_mode: true
    
    After updating the configuration, restart the API server for the change to take effect.
  • Autostop is not supported. Clusters remain active until you explicitly tear them down with sky down. Remember to release allocations when they are no longer needed to avoid wasting resources.
  • sky exec requires explicit GPU requests for GPU visibility. When running tasks with sky exec on container-based clusters, GPUs may not be visible unless you specify the --gpus flag. For example, sky exec -c [CLUSTER-NAME] --gpus [GPU-TYPE]:[GPU-COUNT] -- nvidia-smi -L. For interactive GPU work, you can also use ssh [CLUSTER-NAME] to connect directly to the running allocation.
  • SSH requires SUNK v7.x or later. On SUNK v6.x, SSH into running jobs does not work due to Dropbear and enroot compatibility issues.
  • Container support requires SUNK v7.x or later. Earlier versions using enroot 3.5.0 have a bug that prevents Pyxis from finding running containers. This is fixed in enroot 4.0.1 (shipped with SUNK v7.x).
Last modified on May 1, 2026