> ## Documentation Index
> Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Integrate sandboxes with the SUNK Pod Scheduler

> Route sandbox pods through the SUNK Pod Scheduler so Slurm places them alongside your Slurm jobs.

Route sandbox pods through the SUNK Pod Scheduler so that Slurm manages their placement alongside Slurm jobs running in your cluster. Sandboxes become regular Slurm jobs and can run on any node in your cluster with available resources, including sharing CPU resources with other Slurm jobs or sandboxes already running on the same node.

<Note>
  CoreWeave sandboxes are in public preview. For access, contact your CoreWeave account team, [CoreWeave Support](https://cloud.coreweave.com/contact), or email [support@coreweave.com](mailto:support@coreweave.com).
</Note>

<Info>
  **Resource sharing rules**

  Sandboxes that request only CPU resources can land on idle CPU nodes, share CPU nodes with other sandboxes or Slurm jobs, and share CPU resources on GPU nodes where other workloads are running. GPUs cannot be shared between sandboxes and Slurm jobs on the same node because Kubernetes and Slurm use independent GPU allocators. See [Known limitations](/products/sunk/run_workloads/schedule-kubernetes-pods#known-limitations) for details.
</Info>

## Prerequisites

* A CKS cluster with [SUNK](/products/sunk) deployed and the [SUNK Pod Scheduler](/products/sunk/run_workloads/schedule-kubernetes-pods) enabled.
* A [CoreWeave sandbox runner deployed on the same cluster](/products/sandboxes/operations/managed-runners).
* The [CoreWeave Intelligent CLI](https://github.com/coreweave/cwic) (`cwic`), installed and authenticated. See [Deploy and manage a runner](/products/sandboxes/operations/managed-runners) for setup details.
* `cwsandbox-client >= 0.10.0` for client-side annotation support.

## Step 1: Verify the SUNK Pod Scheduler

Validate that your SUNK deployment is configured to work with CoreWeave sandboxes.

1. Verify the scheduler is running:

   ```bash theme={"system"}
   kubectl get pods -n tenant-slurm -l app.kubernetes.io/name=sunk-scheduler
   ```

   If no pods are returned, enable the scheduler in your Slurm Helm values by setting `scheduler.enabled: true`. See [Enable the scheduler](/products/sunk/run_workloads/schedule-kubernetes-pods#enable-the-scheduler) for details.

2. Look up the scheduler name and `KillWait` value. You reuse both when you author the profile in Step 2:

   ```bash theme={"system"}
   kubectl get pods -n tenant-slurm -l app.kubernetes.io/name=sunk-scheduler -o yaml \
     | yq '.items[0].spec.containers[] | select(.name == "scheduler").args'
   ```

   Look for `--scheduler-name` and `--slurm-kill-wait` in the output:

   ```text theme={"system"}
   - --scheduler-name=tenant-slurm-slurm-scheduler
   - --slurm-kill-wait=30s
   ```

   See [Look up the scheduler configuration](/products/sunk/run_workloads/schedule-kubernetes-pods#look-up-the-scheduler-configuration) for the underlying behavior.

3. Confirm the scheduler scope covers the sandbox namespaces. The SUNK Pod Scheduler must be able to watch the namespaces where sandbox pods are created. Check your `scheduler.scope.type` setting in the Slurm Helm values.

   If the scope is `cluster`, the scheduler watches all namespaces and no additional configuration is needed. If the scope is `namespace`, the scheduler only watches the namespaces you list. You have two options:

   * **Pin sandboxes to a known namespace.** Set the profile's namespace strategy to `static` with a fixed namespace (for example, `sandbox-slurm`) and add that namespace to `scheduler.scope.namespaces`. This is the direct path and makes sandbox pods and placeholder jobs straightforward to find. See [Choose a namespace strategy](/products/sandboxes/profiles/configure#choose-a-namespace-strategy).
   * **Let the profile create per-user namespaces.** Pick a `per-user` or `per-profile` strategy with a recognizable `namespacePrefix` (for example, `sb-`), then list the namespaces that match the prefix after the first sandbox runs:

     ```bash theme={"system"}
     kubectl get namespaces -l sandbox.coreweave.com/profile-id
     ```

     Add those namespaces (or the prefix pattern) to `scheduler.scope.namespaces` and re-roll Slurm.

4. Lower `slurmd` resource requests on NodeSets you want to share with sandboxes. This is a change on the Slurm side, in your Slurm Helm values, not in any sandbox configuration. The default NodeSet resource requests consume most of the node's allocatable capacity in Kubernetes, leaving no room for sandbox pods. Without this change, sandbox pods are rejected with `OutOfMemory` or `OutOfcpu` errors. See [Manage resources with the SUNK Pod Scheduler](/products/sunk/run_workloads/manage-scheduler-resources) for configuration details.

## Step 2: Create a SUNK-aware profile

A profile defines the execution environment for sandboxes. To route sandboxes through the SUNK Pod Scheduler, create a profile that sets `schedulerName` and a short `terminationGracePeriodSeconds` on the underlying pod spec. Bind the profile to the runner so that sandboxes launched against it flow through Slurm for placement.

Save the following as `slurm-profile.yaml`. Replace `tenant-slurm-slurm-scheduler` with the scheduler name from Step 1, and pick a `terminationGracePeriodSeconds` value that is strictly less than your cluster's `--slurm-kill-wait` minus 5 seconds (for the default 30-second kill wait, any value below 25 works, and `20` is a safe default):

```yaml theme={"system"}
display_name: slurm
description: SUNK-scheduled sandboxes that run as Slurm jobs
spec:
  namespace:
    strategy: per-user
  network:
    egress:
      default: internet
      modes:
        internet:
          type: internet
        none:
          type: none
  pod:
    spec:
      schedulerName: "tenant-slurm-slurm-scheduler"
      terminationGracePeriodSeconds: 20
```

The key settings:

* **`spec.pod.spec.schedulerName`** directs Kubernetes to hand the pod to the SUNK Pod Scheduler instead of the default Kubernetes scheduler.
* **`spec.pod.spec.terminationGracePeriodSeconds`** must be strictly less than `--slurm-kill-wait` minus 5 seconds. The Kubernetes default of 30 seconds exceeds the default 25-second threshold, so you must set this explicitly. See [Set the termination grace period](/products/sunk/run_workloads/schedule-kubernetes-pods#3-set-the-termination-grace-period) for the underlying rule.
* **`spec.namespace`** chooses how sandbox pods are grouped into namespaces. `per-user` gives each user their own namespace. `static` pins every sandbox to a fixed namespace. See [Choose a namespace strategy](/products/sandboxes/profiles/configure#choose-a-namespace-strategy).
* **`spec.network`** declares the outbound modes sandboxes can pick. The preceding example exposes `internet` (default) and `none`. See [Configure egress](/products/sandboxes/profiles/configure#configure-egress).

<Warning>
  Do not set `pod.spec.nodeSelector` in the profile. Slurm controls node placement, and a Kubernetes `nodeSelector` causes `NodeAffinity` failures when Slurm assigns the pod to a different node type. To target a specific GPU model or instance class, use Slurm partitions or constraints through the annotations in Step 4 instead.
</Warning>

Create the profile:

<Tabs>
  <Tab title="CLI">
    ```bash theme={"system"}
    cwic sandbox profile create -f slurm-profile.yaml
    ```
  </Tab>

  <Tab title="curl">
    ```bash title="Create a SUNK-aware profile" theme={"system"}
    curl -X POST https://api.coreweave.com/v1beta2/sandbox/profile-templates \
      -H "Authorization: Bearer $TOKEN" \
      -H "Content-Type: application/json" \
      -d '{
        "profileTemplate": {
          "displayName": "slurm",
          "description": "SUNK-scheduled sandboxes that run as Slurm jobs",
          "spec": {
            "namespace": { "strategy": "per-user" },
            "network": {
              "egress": {
                "default": "internet",
                "modes": {
                  "internet": { "type": "internet" },
                  "none":     { "type": "none" }
                }
              }
            },
            "pod": {
              "spec": {
                "schedulerName": "tenant-slurm-slurm-scheduler",
                "terminationGracePeriodSeconds": 20
              }
            }
          }
        }
      }'
    ```
  </Tab>
</Tabs>

The response prints the new profile's ID. Bind it to your runner by editing the runner's `profile_bindings` list. The patch replaces the entire list in one transaction, so include every binding you want the runner to keep:

<Tabs>
  <Tab title="CLI">
    ```yaml title="runner-bindings.yaml" theme={"system"}
    profile_bindings:
      - profile_template_id: "[EXISTING-DEFAULT-PROFILE-ID]"
        profile_name: default
        is_default: true
      - profile_template_id: "[SLURM-PROFILE-ID]"
        profile_name: slurm
    ```

    ```bash theme={"system"}
    cwic sandbox runner edit [RUNNER-ID] -f runner-bindings.yaml
    ```
  </Tab>

  <Tab title="curl">
    ```bash title="Attach the SUNK profile to a runner" theme={"system"}
    curl -X PATCH https://api.coreweave.com/v1beta2/sandbox/managed-runners/[RUNNER-ID] \
      -H "Authorization: Bearer $TOKEN" \
      -H "Content-Type: application/json" \
      -d '{
        "runner": {
          "id": "[RUNNER-ID]",
          "profileBindings": [
            { "profileTemplateId": "[EXISTING-DEFAULT-PROFILE-ID]", "profileName": "default", "isDefault": true },
            { "profileTemplateId": "[SLURM-PROFILE-ID]",            "profileName": "slurm" }
          ]
        },
        "updateMask": "profileBindings"
      }'
    ```
  </Tab>
</Tabs>

For the full profile schema and binding semantics, see [Configure a sandbox profile](/products/sandboxes/profiles/configure) and [Manage profile bindings](/products/sandboxes/profiles/configure#manage-profile-bindings).

## Step 3: Create sandboxes

To route sandboxes through the SUNK Pod Scheduler, ask for the `slurm` profile when you create a sandbox. The SDK's `runway_ids` parameter selects which profile to use. Pass the `profile_name` from the preceding runner binding.

```python theme={"system"}
from cwsandbox import Sandbox

with Sandbox.run(runway_ids=["slurm"]) as sb:
    print(f"Sandbox ID: {sb.sandbox_id}")
    result = sb.exec(["hostname"]).result()
    print(result.stdout)
```

To verify the sandbox is running as a Slurm job, search for its placeholder job in Slurm using the sandbox ID printed in the preceding example:

```bash theme={"system"}
sacct --format=JobID,JobName%60,State,NodeList -X | grep [SANDBOX-ID]
```

Slurm picks the node. You don't control which node the sandbox lands on unless you add [Slurm annotations](#step-4-control-placement-with-slurm-annotations) to guide Slurm's scheduler.

To set session defaults so that all sandboxes in a session use the SUNK profile, pass `runway_ids` on `SandboxDefaults`:

```python theme={"system"}
from cwsandbox import Sandbox, SandboxDefaults

defaults = SandboxDefaults(runway_ids=("slurm",))

with Sandbox.session(defaults) as session:
    sb1 = session.sandbox()
    sb2 = session.sandbox()

    r1 = sb1.exec(["hostname"]).result()
    r2 = sb2.exec(["hostname"]).result()
    print(f"sb1: {r1.stdout.strip()}, sb2: {r2.stdout.strip()}")
```

### Resource requests and Slurm accounting

SUNK reads the pod's resource **requests** (not limits) and converts them to Slurm job parameters:

| Pod field         | Slurm parameter |
| ----------------- | --------------- |
| `requests.cpu`    | `CPUsPerTask`   |
| `requests.memory` | `MinMemoryNode` |

Slurm uses these values for scheduling decisions and `sacct` accounting. SUNK does not require any particular Quality of Service class. Guaranteed (requests equal limits) and Burstable (requests lower than limits) both work.

If you set requests lower than limits with `ResourceOptions`, the pod can burst up to the limits when capacity is free, but Slurm only sees the requests. For example, a sandbox configured with:

```python theme={"system"}
requests={"cpu": "500m", "memory": "512Mi"},
limits={"cpu": "2",     "memory": "2Gi"},
```

shows up in `sacct` as a 500m CPU, 512Mi memory job, even though the sandbox can use up to 2 CPUs and 2Gi when the node has room. See [Resources](/products/sandboxes/client/guides/sandbox-configuration#resources) for the full `ResourceOptions` reference.

Size the requests to match what your sandbox workloads need, leaving enough room on the target nodes for the `slurmd` requests you lowered in Step 1. For the underlying rules, see [Set resource requests](/products/sunk/run_workloads/schedule-kubernetes-pods#2-set-resource-requests) and [Manage resources with the SUNK Pod Scheduler](/products/sunk/run_workloads/manage-scheduler-resources).

## Step 4: Control placement with Slurm annotations

To control sandbox placement, set SUNK annotations on the sandbox at launch time. The following example pins the sandbox to the `hpc-prod` partition:

```python theme={"system"}
sb = Sandbox.run(
    runway_ids=["slurm"],
    annotations={
        "sunk.coreweave.com/partition": "hpc-prod",
    },
)
```

### Common annotations

All annotations share the `sunk.coreweave.com/` prefix. The annotations commonly used to control sandbox placement are:

| Annotation key | Description                                                                      |
| -------------- | -------------------------------------------------------------------------------- |
| `partition`    | Slurm partition name                                                             |
| `constraint`   | Slurm feature constraint                                                         |
| `account`      | Slurm accounting name                                                            |
| `qos`          | Slurm QoS level                                                                  |
| `user-id`      | Slurm user ID for accounting. Must be a numeric Linux UID (for example `"1000"`) |
| `exclusive`    | Node exclusivity (`none`, `user`, or `ok`)                                       |

Passing a username instead of a numeric UID to the `user-id` annotation causes a blocking error that prevents Slurm from scheduling the sandbox. To find the numeric UID from a Slurm login node:

```bash theme={"system"}
id -u           # your own UID
id -u [USERNAME]  # someone else's UID
whoami          # confirm your username first if needed
```

For the full list, see [Annotations reference](/products/sunk/run_workloads/schedule-kubernetes-pods#annotations-reference).

### Enforce annotations in the profile

Operators can pin SUNK annotations on the profile to enforce Slurm job parameters for every sandbox that uses the profile. For example, to restrict all sandboxes on the `slurm` profile to a `sandboxes` partition, add the annotation to `spec.pod.metadata.annotations`:

```yaml theme={"system"}
spec:
  pod:
    metadata:
      annotations:
        sunk.coreweave.com/partition: "sandboxes"
```

<Note>
  Pod annotations are an exception to the [normal layered override order](/products/sandboxes/profiles/profiles#how-fields-combine-at-runtime). For most profile fields, a per-sandbox value wins over a profile value because per-sandbox overrides have the highest precedence. For pod annotations, the gateway instead checks for conflicts: if a client passes the same annotation key that the profile already pins, the request is rejected with `annotation_conflict`. This is intentional. It lets operators enforce account, partition, and QoS without users being able to override them.
</Note>

To leave an annotation client-configurable, omit it from the profile.

### Match the Slurm user from a training job

Training jobs often run with `--exclusive=user` to claim entire nodes for a single user. This prevents other users' jobs from competing for resources on those nodes while still allowing the same user to run additional jobs there, such as sandboxes that use spare CPU alongside GPU training.

By default, SUNK placeholder jobs run as root (UID 0). Because root is a different user than the one who submitted the training job, Slurm does not place the sandbox placeholder on the exclusive node.

When training code uses the `cwsandbox` Python client to create sandboxes from within a running Slurm job, it can read the job's Slurm user ID from the environment and pass it as an annotation. This ensures the sandbox placeholder jobs are submitted under the same Slurm user as the training job, allowing Slurm to place them on the same exclusive nodes:

```python theme={"system"}
import os
from cwsandbox import Sandbox, SandboxDefaults

slurm_uid = os.environ["SLURM_JOB_UID"]

defaults = SandboxDefaults(
    runway_ids=("slurm",),
    annotations={
        "sunk.coreweave.com/user-id": slurm_uid,
    },
)

with Sandbox.session(defaults) as session:
    sb = session.sandbox()
    result = sb.exec(["hostname"]).result()
    print(result.stdout)
```

The `user-id` annotation must be a numeric Linux UID, not a username. When set, SUNK also defaults the `group-id` to the same value. Set `sunk.coreweave.com/group-id` separately if the group ID differs.

## Troubleshooting

Use the following sections to diagnose common issues when running sandboxes through the SUNK Pod Scheduler.

### Placeholder jobs temporarily in completing state

When a sandbox stops, its Slurm placeholder job spends 30 to 60 seconds in the `CG` (completing) state while cleanup scripts run and the node is released back to the pool. This is normal Slurm behavior and does not affect other jobs running on the same node. See [Slurm job states](/products/sunk/manage_sunk/slurm-job-states) for the full state reference.

### Sandboxes not landing where expected

Slurm determines sandbox placement based on the annotations the profile sets or the client passes. If sandboxes are not landing on the expected nodes, verify the Slurm job parameters.

SUNK creates a placeholder Slurm job for each sandbox pod with the name `<namespace>/<pod-name>`. The pod name includes the sandbox ID, which is available from the client as `sb.sandbox_id`. Find the placeholder job and inspect its parameters by searching for the sandbox ID:

```bash theme={"system"}
sacct --format=JobID,JobName%60,Partition,Account,AllocCPUS,ReqMem,State,NodeList,Start,End -X \
  | grep [SANDBOX-ID]
```

Replace `[SANDBOX-ID]` with the value of `sb.sandbox_id` from the Python client.

## See also

* [About CoreWeave sandboxes](/products/sandboxes)
* [Deploy and manage a runner](/products/sandboxes/operations/managed-runners)
* [Configure a sandbox profile](/products/sandboxes/profiles/configure)
* [Schedule Kubernetes pods with the SUNK Pod Scheduler](/products/sunk/run_workloads/schedule-kubernetes-pods)
* [RL training with sandboxes](/products/sandboxes/client/guides/rl-training)
* [Sandbox configuration](/products/sandboxes/client/guides/sandbox-configuration)
