Route sandbox pods through the SUNK Pod Scheduler so that Slurm manages their placement alongside Slurm jobs running in your cluster. Sandboxes become regular Slurm jobs and can run on any node in your cluster with available resources, including sharing CPU resources with other Slurm jobs or sandboxes already running on the same node.Documentation Index
Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
Use this file to discover all available pages before exploring further.
CoreWeave sandboxes are in public preview. For access, contact your CoreWeave account team, reach out to CoreWeave Support, or email support@coreweave.com.
Resource sharing rulesSandboxes that request only CPU resources can land on idle CPU nodes, share CPU nodes with other sandboxes or Slurm jobs, and share CPU resources on GPU nodes where other workloads are running. GPUs cannot be shared between sandboxes and Slurm jobs on the same node because Kubernetes and Slurm use independent GPU allocators. See Known limitations for details.
Prerequisites
- A CKS cluster with SUNK deployed and the SUNK Pod Scheduler enabled.
- A CoreWeave sandbox runner deployed on the same cluster.
- The CoreWeave Intelligent CLI (
cwic), installed and authenticated. See Deploy and manage a runner for setup details. cwsandbox-client >= 0.10.0for client-side annotation support.
Step 1: Verify the SUNK Pod Scheduler
Validate that your SUNK deployment is configured to work with CoreWeave sandboxes.-
Verify the scheduler is running:
If no pods are returned, enable the scheduler in your Slurm Helm values by setting
scheduler.enabled: true. See Enable the scheduler for details. -
Look up the scheduler name and
KillWaitvalue. You will reuse both when you author the profile in Step 2:Look for--scheduler-nameand--slurm-kill-waitin the output:See Look up the scheduler configuration for the underlying behavior. -
Confirm the scheduler scope covers the sandbox namespaces. The SUNK Pod Scheduler must be able to watch the namespaces where sandbox pods are created. Check your
scheduler.scope.typesetting in the Slurm Helm values. If the scope iscluster, the scheduler watches all namespaces and no additional configuration is needed. If the scope isnamespace, the scheduler only watches the namespaces you list. You have two ways forward:-
Pin sandboxes to a known namespace. Set the profile’s namespace strategy to
staticwith a fixed namespace (for example,sandbox-slurm) and add that namespace toscheduler.scope.namespaces. This is the simplest path and makes sandbox pods and placeholder jobs easy to find. See Choose a namespace strategy. -
Let the profile create per-user namespaces. Pick a
per-userorper-profilestrategy with a recognizablenamespacePrefix(for example,sb-), then list the namespaces that match the prefix after the first sandbox runs:Add those namespaces (or the prefix pattern) toscheduler.scope.namespacesand re-roll Slurm.
-
Pin sandboxes to a known namespace. Set the profile’s namespace strategy to
-
Lower
slurmdresource requests on NodeSets you want to share with sandboxes. This is a change on the Slurm side, in your Slurm Helm values, not in any sandbox configuration. The default NodeSet resource requests consume most of the node’s allocatable capacity in Kubernetes, leaving no room for sandbox pods. Without this change, sandbox pods are rejected withOutOfMemoryorOutOfcpuerrors. See Manage resources with the SUNK Pod Scheduler for configuration details.
Step 2: Create a SUNK-aware profile
A profile defines the execution environment for sandboxes. To route sandboxes through the SUNK Pod Scheduler, create a profile that setsschedulerName and a short terminationGracePeriodSeconds on the underlying pod spec. Bind the profile to the runner so that sandboxes launched against it flow through Slurm for placement.
Save the following as slurm-profile.yaml. Replace tenant-slurm-slurm-scheduler with the scheduler name from Step 1, and pick a terminationGracePeriodSeconds value that is strictly less than your cluster’s --slurm-kill-wait minus 5 seconds (for the default 30-second kill wait, any value below 25 works; 20 is a safe default):
spec.pod.spec.schedulerNametells Kubernetes to hand the pod to the SUNK Pod Scheduler instead of the default Kubernetes scheduler.spec.pod.spec.terminationGracePeriodSecondsmust be strictly less than--slurm-kill-waitminus 5 seconds. The Kubernetes default of 30 seconds exceeds the default 25-second threshold, so you must set this explicitly. See Set the termination grace period for the underlying rule.spec.namespacechooses how sandbox pods are grouped into namespaces.per-usergives each user their own namespace;staticpins every sandbox to a fixed namespace. See Choose a namespace strategy.spec.networkdeclares the outbound modes sandboxes can pick. The example above exposesinternet(default) andnone. See Configure egress.
- CLI
- curl
profile_bindings list. The patch replaces the entire list in one transaction, so include every binding you want the runner to keep:
- CLI
- curl
runner-bindings.yaml
Step 3: Create sandboxes
To route sandboxes through the SUNK Pod Scheduler, ask for theslurm profile when you create a sandbox. The SDK’s runway_ids parameter selects which profile to use; pass the profile_name from the runner binding above.
runway_ids on SandboxDefaults:
Resource requests and Slurm accounting
SUNK reads the pod’s resource requests (not limits) and converts them to Slurm job parameters:| Pod field | Slurm parameter |
|---|---|
requests.cpu | CPUsPerTask |
requests.memory | MinMemoryNode |
sacct accounting. SUNK does not require any particular Quality of Service class; Guaranteed (requests equal limits) and Burstable (requests lower than limits) both work.
If you set requests lower than limits with ResourceOptions, the pod can burst up to the limits when capacity is free, but Slurm only sees the requests. For example, a sandbox configured with:
sacct as a 500m CPU, 512Mi memory job, even though the sandbox can use up to 2 CPUs and 2Gi when the node has room. See Resources for the full ResourceOptions reference.
Size the requests to match what your sandbox workloads actually need, leaving enough room on the target nodes for the slurmd requests you lowered in Step 1. For the underlying rules, see Set resource requests and Manage resources with the SUNK Pod Scheduler.
Step 4: Control placement with Slurm annotations
To control sandbox placement, set SUNK annotations on the sandbox at launch time. The example below pins the sandbox to thehpc-prod partition:
Common annotations
All annotations share thesunk.coreweave.com/ prefix. The annotations most commonly used to control sandbox placement are:
| Annotation key | Description |
|---|---|
partition | Slurm partition name |
constraint | Slurm feature constraint |
account | Slurm accounting name |
qos | Slurm QoS level |
user-id | Slurm user ID for accounting. Must be a numeric Linux UID (for example "1000") |
exclusive | Node exclusivity (none, user, or ok) |
user-id annotation causes a blocking error that prevents Slurm from scheduling the sandbox. To find the numeric UID from a Slurm login node:
Enforce annotations in the profile
Operators can pin SUNK annotations on the profile to enforce Slurm job parameters for every sandbox that uses the profile. For example, to restrict all sandboxes on theslurm profile to a sandboxes partition, add the annotation to spec.pod.metadata.annotations:
Pod annotations are an exception to the normal layered override order. For most profile fields, a per-sandbox value wins over a profile value because per-sandbox overrides have the highest precedence. For pod annotations, the gateway instead checks for conflicts: if a client passes the same annotation key that the profile already pins, the request is rejected with
annotation_conflict. This is intentional. It lets operators enforce account, partition, and QoS without users being able to override them.Match the Slurm user from a training job
Training jobs often run with--exclusive=user to claim entire nodes for a single user. This prevents other users’ jobs from competing for resources on those nodes while still allowing the same user to run additional jobs there, such as sandboxes that use spare CPU alongside GPU training.
By default, SUNK placeholder jobs run as root (UID 0). Since root is a different user than the one who submitted the training job, Slurm will not place the sandbox placeholder on the exclusive node.
When training code uses the cwsandbox Python client to create sandboxes from within a running Slurm job, it can read the job’s Slurm user ID from the environment and pass it as an annotation. This ensures the sandbox placeholder jobs are submitted under the same Slurm user as the training job, allowing Slurm to place them on the same exclusive nodes:
user-id annotation must be a numeric Linux UID, not a username. When set, SUNK also defaults the group-id to the same value. Set sunk.coreweave.com/group-id separately if the group ID differs.
Troubleshooting
Placeholder jobs temporarily in completing state
When a sandbox stops, its Slurm placeholder job spends 30-60 seconds in theCG (completing) state while cleanup scripts run and the node is released back to the pool. This is normal Slurm behavior and does not affect other jobs running on the same node. See Slurm job states for the full state reference.
Sandboxes not landing where expected
Sandbox placement is determined by Slurm based on the annotations set in the profile or passed from the client. If sandboxes are not landing on the expected nodes, verify the Slurm job parameters. SUNK creates a placeholder Slurm job for each sandbox pod with the name<namespace>/<pod-name>. The pod name includes the sandbox ID, which is available from the client as sb.sandbox_id. Find the placeholder job and inspect its parameters by searching for the sandbox ID:
[SANDBOX-ID] with the value of sb.sandbox_id from the Python client.