Integrate sandboxes with the SUNK Pod Scheduler

Route sandbox pods through the SUNK Pod Scheduler so that Slurm manages their placement alongside Slurm jobs running in your cluster. Sandboxes become regular Slurm jobs and can run on any node in your cluster with available resources, including sharing CPU resources with other Slurm jobs or sandboxes already running on the same node.

CoreWeave sandboxes are in public preview. For access, contact your CoreWeave account team, reach out to CoreWeave Support, or email support@coreweave.com.

Resource sharing rulesSandboxes that request only CPU resources can land on idle CPU nodes, share CPU nodes with other sandboxes or Slurm jobs, and share CPU resources on GPU nodes where other workloads are running. GPUs cannot be shared between sandboxes and Slurm jobs on the same node because Kubernetes and Slurm use independent GPU allocators. See Known limitations for details.

Prerequisites

A CKS cluster with SUNK deployed and the SUNK Pod Scheduler enabled.
A CoreWeave sandbox runner deployed on the same cluster.
The CoreWeave Intelligent CLI (cwic), installed and authenticated. See Deploy and manage a runner for setup details.
cwsandbox-client >= 0.10.0 for client-side annotation support.

Step 1: Verify the SUNK Pod Scheduler

Validate that your SUNK deployment is configured to work with CoreWeave sandboxes.

Verify the scheduler is running:
```
kubectl get pods -n tenant-slurm -l app.kubernetes.io/name=sunk-scheduler
```
If no pods are returned, enable the scheduler in your Slurm Helm values by setting scheduler.enabled: true. See Enable the scheduler for details.

Look up the scheduler name and KillWait value. You reuse both when you author the profile in Step 2:

kubectl get pods -n tenant-slurm -l app.kubernetes.io/name=sunk-scheduler -o yaml \
  | yq '.items[0].spec.containers[] | select(.name == "scheduler").args'

Look for --scheduler-name and --slurm-kill-wait in the output:

- --scheduler-name=tenant-slurm-slurm-scheduler
- --slurm-kill-wait=30s

See Look up the scheduler configuration for the underlying behavior.

Confirm the scheduler scope covers the sandbox namespaces. The SUNK Pod Scheduler must be able to watch the namespaces where sandbox pods are created. Check your scheduler.scope.type setting in the Slurm Helm values. If the scope is cluster, the scheduler watches all namespaces and no additional configuration is needed. If the scope is namespace, the scheduler only watches the namespaces you list. You have two options:
- Pin sandboxes to a known namespace. Set the profile’s namespace strategy to static with a fixed namespace (for example, sandbox-slurm) and add that namespace to scheduler.scope.namespaces. This is the direct path and makes sandbox pods and placeholder jobs straightforward to find. See Choose a namespace strategy.
- Let the profile create per-user namespaces. Pick a per-user or per-profile strategy with a recognizable namespacePrefix (for example, sb-), then list the namespaces that match the prefix after the first sandbox runs:
  kubectl get namespaces -l sandbox.coreweave.com/profile-id
  Add those namespaces (or the prefix pattern) to scheduler.scope.namespaces and re-roll Slurm.
Lower slurmd resource requests on NodeSets you want to share with sandboxes. This is a change on the Slurm side, in your Slurm Helm values, not in any sandbox configuration. The default NodeSet resource requests consume most of the node’s allocatable capacity in Kubernetes, leaving no room for sandbox pods. Without this change, sandbox pods are rejected with OutOfMemory or OutOfcpu errors. See Manage resources with the SUNK Pod Scheduler for configuration details.

Step 2: Create a SUNK-aware profile

A profile defines the execution environment for sandboxes. To route sandboxes through the SUNK Pod Scheduler, create a profile that sets schedulerName and a short terminationGracePeriodSeconds on the underlying pod spec. Bind the profile to the runner so that sandboxes launched against it flow through Slurm for placement. Save the following as slurm-profile.yaml. Replace tenant-slurm-slurm-scheduler with the scheduler name from Step 1, and pick a terminationGracePeriodSeconds value that is strictly less than your cluster’s --slurm-kill-wait minus 5 seconds (for the default 30-second kill wait, any value below 25 works, and 20 is a safe default):

display_name: slurm
description: SUNK-scheduled sandboxes that run as Slurm jobs
spec:
  namespace:
    strategy: per-user
  network:
    egress:
      default: internet
      modes:
        internet:
          type: internet
        none:
          type: none
  pod:
    spec:
      schedulerName: "tenant-slurm-slurm-scheduler"
      terminationGracePeriodSeconds: 20

The key settings:

spec.pod.spec.schedulerName directs Kubernetes to hand the pod to the SUNK Pod Scheduler instead of the default Kubernetes scheduler.
spec.pod.spec.terminationGracePeriodSeconds must be strictly less than --slurm-kill-wait minus 5 seconds. The Kubernetes default of 30 seconds exceeds the default 25-second threshold, so you must set this explicitly. See Set the termination grace period for the underlying rule.
spec.namespace chooses how sandbox pods are grouped into namespaces. per-user gives each user their own namespace. static pins every sandbox to a fixed namespace. See Choose a namespace strategy.
spec.network declares the outbound modes sandboxes can pick. The preceding example exposes internet (default) and none. See Configure egress.

Do not set pod.spec.nodeSelector in the profile. Slurm controls node placement, and a Kubernetes nodeSelector causes NodeAffinity failures when Slurm assigns the pod to a different node type. To target a specific GPU model or instance class, use Slurm partitions or constraints through the annotations in Step 4 instead.

Create the profile:

CLI
curl

cwic sandbox profile create -f slurm-profile.yaml

Create a SUNK-aware profile

curl -X POST https://api.coreweave.com/v1beta2/sandbox/profile-templates \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "profileTemplate": {
      "displayName": "slurm",
      "description": "SUNK-scheduled sandboxes that run as Slurm jobs",
      "spec": {
        "namespace": { "strategy": "per-user" },
        "network": {
          "egress": {
            "default": "internet",
            "modes": {
              "internet": { "type": "internet" },
              "none":     { "type": "none" }
            }
          }
        },
        "pod": {
          "spec": {
            "schedulerName": "tenant-slurm-slurm-scheduler",
            "terminationGracePeriodSeconds": 20
          }
        }
      }
    }
  }'

The response prints the new profile’s ID. Bind it to your runner by editing the runner’s profile_bindings list. The patch replaces the entire list in one transaction, so include every binding you want the runner to keep:

CLI
curl

runner-bindings.yaml

profile_bindings:
  - profile_template_id: "[EXISTING-DEFAULT-PROFILE-ID]"
    profile_name: default
    is_default: true
  - profile_template_id: "[SLURM-PROFILE-ID]"
    profile_name: slurm

cwic sandbox runner edit [RUNNER-ID] -f runner-bindings.yaml

Attach the SUNK profile to a runner

curl -X PATCH https://api.coreweave.com/v1beta2/sandbox/managed-runners/[RUNNER-ID] \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "runner": {
      "id": "[RUNNER-ID]",
      "profileBindings": [
        { "profileTemplateId": "[EXISTING-DEFAULT-PROFILE-ID]", "profileName": "default", "isDefault": true },
        { "profileTemplateId": "[SLURM-PROFILE-ID]",            "profileName": "slurm" }
      ]
    },
    "updateMask": "profileBindings"
  }'

For the full profile schema and binding semantics, see Configure a sandbox profile and Manage profile bindings.

Step 3: Create sandboxes

To route sandboxes through the SUNK Pod Scheduler, ask for the slurm profile when you create a sandbox. The SDK’s runway_ids parameter selects which profile to use. Pass the profile_name from the preceding runner binding.

from cwsandbox import Sandbox

with Sandbox.run(runway_ids=["slurm"]) as sb:
    print(f"Sandbox ID: {sb.sandbox_id}")
    result = sb.exec(["hostname"]).result()
    print(result.stdout)

To verify the sandbox is running as a Slurm job, search for its placeholder job in Slurm using the sandbox ID printed in the preceding example:

sacct --format=JobID,JobName%60,State,NodeList -X | grep [SANDBOX-ID]

Slurm picks the node. You don’t control which node the sandbox lands on unless you add Slurm annotations to guide Slurm’s scheduler. To set session defaults so that all sandboxes in a session use the SUNK profile, pass runway_ids on SandboxDefaults:

from cwsandbox import Sandbox, SandboxDefaults

defaults = SandboxDefaults(runway_ids=("slurm",))

with Sandbox.session(defaults) as session:
    sb1 = session.sandbox()
    sb2 = session.sandbox()

    r1 = sb1.exec(["hostname"]).result()
    r2 = sb2.exec(["hostname"]).result()
    print(f"sb1: {r1.stdout.strip()}, sb2: {r2.stdout.strip()}")

Resource requests and Slurm accounting

SUNK reads the pod’s resource requests (not limits) and converts them to Slurm job parameters:

Pod field	Slurm parameter
`requests.cpu`	`CPUsPerTask`
`requests.memory`	`MinMemoryNode`

Slurm uses these values for scheduling decisions and sacct accounting. SUNK does not require any particular Quality of Service class. Guaranteed (requests equal limits) and Burstable (requests lower than limits) both work. If you set requests lower than limits with ResourceOptions, the pod can burst up to the limits when capacity is free, but Slurm only sees the requests. For example, a sandbox configured with:

requests={"cpu": "500m", "memory": "512Mi"},
limits={"cpu": "2",     "memory": "2Gi"},

shows up in sacct as a 500m CPU, 512Mi memory job, even though the sandbox can use up to 2 CPUs and 2Gi when the node has room. See Resources for the full ResourceOptions reference. Size the requests to match what your sandbox workloads need, leaving enough room on the target nodes for the slurmd requests you lowered in Step 1. For the underlying rules, see Set resource requests and Manage resources with the SUNK Pod Scheduler.

Step 4: Control placement with Slurm annotations

To control sandbox placement, set SUNK annotations on the sandbox at launch time. The following example pins the sandbox to the hpc-prod partition:

sb = Sandbox.run(
    runway_ids=["slurm"],
    annotations={
        "sunk.coreweave.com/partition": "hpc-prod",
    },
)

Common annotations

All annotations share the sunk.coreweave.com/ prefix. The annotations commonly used to control sandbox placement are:

Annotation key	Description
`partition`	Slurm partition name
`constraint`	Slurm feature constraint
`account`	Slurm accounting name
`qos`	Slurm QoS level
`user-id`	Slurm user ID for accounting. Must be a numeric Linux UID (for example `"1000"`)
`exclusive`	Node exclusivity (`none`, `user`, or `ok`)

Passing a username instead of a numeric UID to the user-id annotation causes a blocking error that prevents Slurm from scheduling the sandbox. To find the numeric UID from a Slurm login node:

id -u           # your own UID
id -u [USERNAME]  # someone else's UID
whoami          # confirm your username first if needed

For the full list, see Annotations reference.

Enforce annotations in the profile

Operators can pin SUNK annotations on the profile to enforce Slurm job parameters for every sandbox that uses the profile. For example, to restrict all sandboxes on the slurm profile to a sandboxes partition, add the annotation to spec.pod.metadata.annotations:

spec:
  pod:
    metadata:
      annotations:
        sunk.coreweave.com/partition: "sandboxes"

Pod annotations are an exception to the normal layered override order. For most profile fields, a per-sandbox value wins over a profile value because per-sandbox overrides have the highest precedence. For pod annotations, the gateway instead checks for conflicts: if a client passes the same annotation key that the profile already pins, the request is rejected with annotation_conflict. This is intentional. It lets operators enforce account, partition, and QoS without users being able to override them.

To leave an annotation client-configurable, omit it from the profile.

Match the Slurm user from a training job

Training jobs often run with --exclusive=user to claim entire nodes for a single user. This prevents other users’ jobs from competing for resources on those nodes while still allowing the same user to run additional jobs there, such as sandboxes that use spare CPU alongside GPU training. By default, SUNK placeholder jobs run as root (UID 0). Because root is a different user than the one who submitted the training job, Slurm does not place the sandbox placeholder on the exclusive node. When training code uses the cwsandbox Python client to create sandboxes from within a running Slurm job, it can read the job’s Slurm user ID from the environment and pass it as an annotation. This ensures the sandbox placeholder jobs are submitted under the same Slurm user as the training job, allowing Slurm to place them on the same exclusive nodes:

import os
from cwsandbox import Sandbox, SandboxDefaults

slurm_uid = os.environ["SLURM_JOB_UID"]

defaults = SandboxDefaults(
    runway_ids=("slurm",),
    annotations={
        "sunk.coreweave.com/user-id": slurm_uid,
    },
)

with Sandbox.session(defaults) as session:
    sb = session.sandbox()
    result = sb.exec(["hostname"]).result()
    print(result.stdout)

The user-id annotation must be a numeric Linux UID, not a username. When set, SUNK also defaults the group-id to the same value. Set sunk.coreweave.com/group-id separately if the group ID differs.

Troubleshooting

Use the following sections to diagnose common issues when running sandboxes through the SUNK Pod Scheduler.

Placeholder jobs temporarily in completing state

When a sandbox stops, its Slurm placeholder job spends 30 to 60 seconds in the CG (completing) state while cleanup scripts run and the node is released back to the pool. This is normal Slurm behavior and does not affect other jobs running on the same node. See Slurm job states for the full state reference.

Sandboxes not landing where expected

Slurm determines sandbox placement based on the annotations the profile sets or the client passes. If sandboxes are not landing on the expected nodes, verify the Slurm job parameters. SUNK creates a placeholder Slurm job for each sandbox pod with the name <namespace>/<pod-name>. The pod name includes the sandbox ID, which is available from the client as sb.sandbox_id. Find the placeholder job and inspect its parameters by searching for the sandbox ID:

sacct --format=JobID,JobName%60,Partition,Account,AllocCPUS,ReqMem,State,NodeList,Start,End -X \
  | grep [SANDBOX-ID]

Replace [SANDBOX-ID] with the value of sb.sandbox_id from the Python client.

​Prerequisites

​Step 1: Verify the SUNK Pod Scheduler

​Step 2: Create a SUNK-aware profile

​Step 3: Create sandboxes

​Resource requests and Slurm accounting

​Step 4: Control placement with Slurm annotations

​Common annotations

​Enforce annotations in the profile

​Match the Slurm user from a training job

​Troubleshooting

​Placeholder jobs temporarily in completing state

​Sandboxes not landing where expected

​See also

Prerequisites

Step 1: Verify the SUNK Pod Scheduler

Step 2: Create a SUNK-aware profile

Step 3: Create sandboxes

Resource requests and Slurm accounting

Step 4: Control placement with Slurm annotations

Common annotations

Enforce annotations in the profile

Match the Slurm user from a training job

Troubleshooting

Placeholder jobs temporarily in completing state

Sandboxes not landing where expected

See also