SUNK Pod Scheduler integration

Route sandbox pods through the SUNK Pod Scheduler so that Slurm manages their placement alongside Slurm jobs running in your cluster. Sandboxes become regular Slurm jobs and can run on any node in your cluster with available resources, including sharing CPU resources with other Slurm jobs or sandboxes already running on the same node.

Resource sharing rulesSandboxes that request only CPU resources can land on idle CPU nodes, share CPU nodes with other sandboxes or Slurm jobs, and share CPU resources on GPU nodes where other workloads are running. However, GPUs cannot be shared between sandboxes and Slurm jobs on the same node because Kubernetes and Slurm use independent GPU allocators. See GPU resources for details.

Prerequisites

A CKS cluster with SUNK deployed and the SUNK Pod Scheduler enabled.
A CoreWeave Sandbox Runner deployed in your CKS cluster.
Version requirements:
- Runner image: >=v0.31.0
- cwsandbox-client: >=0.10.0

Step 1: Verify the SUNK Pod Scheduler

Validate that your SUNK deployment is configured to work with CoreWeave sandboxes.

Verify the scheduler is running:
```
kubectl get pods -n tenant-slurm -l app.kubernetes.io/name=sunk-scheduler
```
If no pods are returned, enable the scheduler in your Slurm Helm values by setting scheduler.enabled: true. See Schedule Kubernetes pods with the SUNK Pod Scheduler for details.

Verify the scheduler name. The name is tenant-slurm-slurm-scheduler in most deployments. Confirm by running:

kubectl get pods -n tenant-slurm -l app.kubernetes.io/name=sunk-scheduler -o yaml \
  | grep -o "\-\-scheduler-name=[^ ]*"

Note the value after the = sign.

Verify the scheduler scope. The SUNK Pod Scheduler must be able to watch the namespaces where sandbox pods are created. Check your scheduler.scope.type setting in the Slurm Helm values. See Check the scheduler scope for configuration details. If the scope is set to cluster, the scheduler watches all namespaces and no additional configuration is needed. If the scope is set to namespace, the scheduler only watches specific namespaces. In that case, either add the sandbox namespaces to scheduler.scope.namespaces, or set the profile’s namespace strategy to static with a fixed namespace that the scheduler already watches (for example, slurm-sandboxes). Using a static namespace also simplifies sandbox pod and placeholder job discoverability.
Lower slurmd resource requests. The default NodeSet resource requests consume most of the node’s allocatable capacity in Kubernetes, leaving no room for sandbox pods. Add a low-requests compute definition to the NodeSets you want to use with sandboxes so that Kubernetes has capacity to schedule sandbox pods alongside Slurm jobs. Without this change, sandbox pods are rejected with OutOfMemory or OutOfcpu errors. See Manage resources for configuration details.

Step 2: Add a SUNK profile to the Runner

A profile defines an execution environment for sandboxes. To route sandboxes through the SUNK Pod Scheduler, add a new profile named slurm to your Runner’s profileConfig. This profile exists alongside any profiles you already have configured. The following shows what the slurm profile looks like inside sandbox-tower-values.yaml:

sandbox-tower-values.yaml

profileConfig:
  # ... other profiles ...
  slurm:
    tags:
      - sunk
    pod:
      spec:
        schedulerName: "tenant-slurm-slurm-scheduler"
        terminationGracePeriodSeconds: 24
    namespace:
      strategy: per-user
      autoCreate: true
    network:
      egress:
        default: internet
        modes:
          internet:
            type: internet
          none:
            type: none

Replace tenant-slurm-slurm-scheduler with your scheduler name if it differs from the value in Step 1. The key settings:

schedulerName tells Kubernetes to hand the pod to the SUNK Pod Scheduler instead of the default scheduler.
terminationGracePeriodSeconds must be less than Slurm’s KillWait setting minus 5 seconds. The default KillWait is 30 seconds, so 24 is a safe value. To check your cluster’s KillWait value:
```
kubectl get configmap -n tenant-slurm tenant-slurm-slurm-conf -o yaml | grep KillWait
```
Alternatively, check the live Slurm configuration from a Slurm login node:
```
# After SSH'ing into your Slurm cluster
scontrol show config | grep KillWait
```
tags add the sunk label to sandbox pods on this profile.
namespace creates a separate namespace per user for sandbox pods.
network defines the available egress modes. Sandboxes default to internet egress.

For the full list of profile configuration options, see Profile configuration.

Do not set pod.spec.nodeSelector in the profile configuration. Slurm controls node placement, and a Kubernetes nodeSelector causes NodeAffinity failures when Slurm assigns the pod to a different node type.

Add the profile to your values

If you already have a sandbox-tower-values.yaml with other profiles, save the profile configuration to a separate file and merge it in with yq:

slurm-profile.yaml

tags:
  - sunk
pod:
  spec:
    schedulerName: "tenant-slurm-slurm-scheduler"
    terminationGracePeriodSeconds: 24
namespace:
  strategy: per-user
  autoCreate: true
network:
  egress:
    default: internet
    modes:
      internet:
        type: internet
      none:
        type: none

yq -i '.profileConfig.slurm = load("slurm-profile.yaml")' sandbox-tower-values.yaml

Deploy the updated Runner

Upgrade the Runner deployment:

helm upgrade sandbox-tower coreweave/sandbox-tower \
  --namespace sandbox-system \
  -f sandbox-tower-values.yaml

Verify the Runner is running:

kubectl logs -n sandbox-system -l app.kubernetes.io/name=sandbox-tower --tail=5

You should see Registration accepted by server and Runner is running in the output.

Step 3: Create sandboxes

To route sandboxes through the SUNK Pod Scheduler, specify the slurm profile ID when creating sandboxes. This targets the profile configured in Step 2.

from cwsandbox import Sandbox

with Sandbox.run(profile_ids=["slurm"]) as sb:
    print(f"Sandbox ID: {sb.sandbox_id}")
    result = sb.exec(["hostname"]).result()
    print(result.stdout)

To verify the sandbox is running as a Slurm job, search for its placeholder job in Slurm using the sandbox ID printed above:

# After SSH'ing into your Slurm cluster
sacct --format=JobID,JobName%60,State,NodeList -X | grep [SANDBOX-ID]

Slurm picks the node. You do not control which node the sandbox lands on unless you add Slurm annotations to guide Slurm’s scheduler. To set session defaults so that all sandboxes in a session use the SUNK profile, use a SandboxDefaults object:

from cwsandbox import Sandbox, SandboxDefaults

defaults = SandboxDefaults(profile_ids=("slurm",))

with Sandbox.session(defaults) as session:
    sb1 = session.sandbox()
    sb2 = session.sandbox()

    r1 = sb1.exec(["hostname"]).result()
    r2 = sb2.exec(["hostname"]).result()
    print(f"sb1: {r1.stdout.strip()}, sb2: {r2.stdout.strip()}")

Resource requests and Slurm accounting

SUNK works with whatever resources are on the pod. It does not require or assume any particular Quality of Service class. Whether the sandbox uses Guaranteed QoS (requests equal limits) or Burstable QoS (requests lower than limits), SUNK reads the pod’s resource requests and converts them to Slurm job parameters: CPU requests become CPUsPerTask and memory requests become MinMemoryNode. Slurm uses these values for scheduling decisions and accounting. When you use ResourceOptions to set requests lower than limits, Slurm accounting reflects the requests values while the pod can burst up to limits. For example, a sandbox with requests={"cpu": "500m", "memory": "512Mi"} and limits={"cpu": "2", "memory": "2Gi"} registers 500m CPU and 512Mi memory in sacct, but the sandbox can use up to 2 CPUs and 2Gi memory when capacity is available. See Resources for details on configuring requests and limits with ResourceOptions. Size resource requests based on what your sandbox workloads need and what capacity is available on the target Nodes after accounting for slurmd requests. See Manage resources for details on how CPU and memory are shared between Kubernetes and Slurm on the same Node.

Control placement with Slurm annotations

Controlling placement with Slurm annotations requires cwsandbox-client>=0.10.0 and tower image >=v0.31.0. To control sandbox placement, set a sunk.coreweave.com/partition annotation in the sandbox configuration:

sb = Sandbox.run(
    profile_ids=["slurm"],
    annotations={
        "sunk.coreweave.com/partition": "hpc-prod",
    },
)

Common annotations

The following annotations are commonly used to control sandbox placement:

Annotation	Description
`sunk.coreweave.com/partition`	Slurm partition name
`sunk.coreweave.com/constraint`	Slurm feature constraint
`sunk.coreweave.com/account`	Slurm accounting name
`sunk.coreweave.com/qos`	Slurm QoS level
`sunk.coreweave.com/user-id`	Slurm user ID for accounting, which must be a numeric Linux UID (for example `"1000"`)
`sunk.coreweave.com/exclusive`	Node exclusivity (`none` or `user`)

Passing a username instead of a numeric UID to the user-id annotation causes a blocking error that prevents Slurm from scheduling the sandbox. Use id <username> on a Slurm login node to find the numeric UID. For the full list of annotations, see Schedule Kubernetes pods with the SUNK Pod Scheduler.

Enforce annotations in the profile configuration

Operators can add SUNK annotations to the profile’s pod.metadata.annotations to enforce Slurm job parameters for all sandboxes on this profile. For example, to restrict all sandboxes to a sandboxes partition, add the following to the profile configuration:

pod:
  metadata:
    annotations:
      sunk.coreweave.com/partition: "sandboxes"

Annotations set in the profile configuration are enforced as guardrails. If a client passes a per-sandbox annotation that conflicts with a profile annotation, the request is rejected with an annotation_conflict error. This prevents users from overriding operator-defined settings such as account, partition, or QoS. To allow users to set an annotation per-sandbox, leave it out of the profile configuration.

Match the Slurm user from a training job

Training jobs often run with --exclusive=user to claim entire nodes for a single user. This prevents other users’ jobs from competing for resources on those nodes while still allowing the same user to run additional jobs there, such as sandboxes that use spare CPU alongside GPU training. By default, SUNK placeholder jobs run as root (UID 0). Since root is a different user than the one who submitted the training job, Slurm will not place the sandbox placeholder on the exclusive node. When training code uses the cwsandbox Python client to create sandboxes from within a running Slurm job, it can read the job’s Slurm user ID from the environment and pass it as an annotation. This ensures the sandbox placeholder jobs are submitted under the same Slurm user as the training job, allowing Slurm to place them on the same exclusive nodes:

import os
from cwsandbox import Sandbox, SandboxDefaults

slurm_uid = os.environ["SLURM_JOB_UID"]

defaults = SandboxDefaults(
    profile_ids=("slurm",),
    annotations={
        "sunk.coreweave.com/user-id": slurm_uid,
    },
)

with Sandbox.session(defaults) as session:
    sb = session.sandbox()
    result = sb.exec(["hostname"]).result()
    print(result.stdout)

The user-id annotation must be a numeric Linux UID, not a username. When set, SUNK also defaults the group-id to the same value. Set sunk.coreweave.com/group-id separately if the group ID differs.

Troubleshooting

Nodes temporarily in completing state

SUNK may mark nodes as completing (CG state) for 30-60 seconds during sandbox cleanup. This does not affect running training jobs on the same node.

Sandboxes not landing where expected

Sandbox placement is determined by Slurm based on the annotations set in the profile configuration or passed from the client. If sandboxes are not landing on the expected nodes, verify the Slurm job parameters. SUNK creates a placeholder Slurm job for each sandbox pod with the name <namespace>/<pod-name>. The pod name includes the sandbox ID, which is available from the client via sb.sandbox_id. Find the placeholder job and inspect its parameters by searching for the sandbox ID:

# After SSH'ing into your Slurm cluster
sacct --format=JobID,JobName%60,Partition,Account,AllocCPUS,ReqMem,State,NodeList,Start,End -X \
  | grep [SANDBOX-ID]

Replace [SANDBOX-ID] with the value of sb.sandbox_id from the Python client.

Python Client

SUNK Pod Scheduler integration

Prerequisites

Step 1: Verify the SUNK Pod Scheduler

Step 2: Add a SUNK profile to the Runner

Add the profile to your values

Deploy the updated Runner

Step 3: Create sandboxes

Resource requests and Slurm accounting

Control placement with Slurm annotations

Common annotations

Enforce annotations in the profile configuration

Match the Slurm user from a training job

Troubleshooting

Nodes temporarily in completing state

Sandboxes not landing where expected

See also

Python Client

Documentation Index

​Prerequisites

​Step 1: Verify the SUNK Pod Scheduler

​Step 2: Add a SUNK profile to the Runner

​Add the profile to your values

​Deploy the updated Runner

​Step 3: Create sandboxes

​Resource requests and Slurm accounting

​Control placement with Slurm annotations

​Common annotations

​Enforce annotations in the profile configuration

​Match the Slurm user from a training job

​Troubleshooting

​Nodes temporarily in completing state

​Sandboxes not landing where expected

​See also

Prerequisites

Step 1: Verify the SUNK Pod Scheduler

Step 2: Add a SUNK profile to the Runner

Add the profile to your values

Deploy the updated Runner

Step 3: Create sandboxes

Resource requests and Slurm accounting

Control placement with Slurm annotations

Common annotations

Enforce annotations in the profile configuration

Match the Slurm user from a training job

Troubleshooting

Nodes temporarily in completing state

Sandboxes not landing where expected

See also