Route sandbox pods through the SUNK Pod Scheduler so that Slurm manages their placement alongside Slurm jobs running in your cluster. Sandboxes become regular Slurm jobs and can run on any node in your cluster with available resources, including sharing CPU resources with other Slurm jobs or sandboxes already running on the same node.Documentation Index
Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
Use this file to discover all available pages before exploring further.
Resource sharing rulesSandboxes that request only CPU resources can land on idle CPU nodes, share CPU nodes with other sandboxes or Slurm jobs, and share CPU resources on GPU nodes where other workloads are running. However, GPUs cannot be shared between sandboxes and Slurm jobs on the same node because Kubernetes and Slurm use independent GPU allocators. See GPU resources for details.
Prerequisites
- A CKS cluster with SUNK deployed and the SUNK Pod Scheduler enabled.
- A CoreWeave Sandbox Runner deployed in your CKS cluster.
- Version requirements:
- Runner image:
>=v0.31.0 - cwsandbox-client:
>=0.10.0
- Runner image:
Step 1: Verify the SUNK Pod Scheduler
Validate that your SUNK deployment is configured to work with CoreWeave sandboxes.-
Verify the scheduler is running:
If no pods are returned, enable the scheduler in your Slurm Helm values by setting
scheduler.enabled: true. See Schedule Kubernetes pods with the SUNK Pod Scheduler for details. -
Verify the scheduler name. The name is
tenant-slurm-slurm-schedulerin most deployments. Confirm by running:Note the value after the=sign. -
Verify the scheduler scope. The SUNK Pod Scheduler must be able to watch the namespaces where sandbox pods are created. Check your
scheduler.scope.typesetting in the Slurm Helm values. See Check the scheduler scope for configuration details. If the scope is set tocluster, the scheduler watches all namespaces and no additional configuration is needed. If the scope is set tonamespace, the scheduler only watches specific namespaces. In that case, either add the sandbox namespaces toscheduler.scope.namespaces, or set the profile’s namespace strategy tostaticwith a fixed namespace that the scheduler already watches (for example,slurm-sandboxes). Using a static namespace also simplifies sandbox pod and placeholder job discoverability. -
Lower
slurmdresource requests. The default NodeSet resource requests consume most of the node’s allocatable capacity in Kubernetes, leaving no room for sandbox pods. Add alow-requestscompute definition to the NodeSets you want to use with sandboxes so that Kubernetes has capacity to schedule sandbox pods alongside Slurm jobs. Without this change, sandbox pods are rejected withOutOfMemoryorOutOfcpuerrors. See Manage resources for configuration details.
Step 2: Add a SUNK profile to the Runner
A profile defines an execution environment for sandboxes. To route sandboxes through the SUNK Pod Scheduler, add a new profile namedslurm to your Runner’s profileConfig. This profile exists alongside any profiles you already have configured.
The following shows what the slurm profile looks like inside sandbox-tower-values.yaml:
sandbox-tower-values.yaml
tenant-slurm-slurm-scheduler with your scheduler name if it differs from the value in Step 1.
The key settings:
-
schedulerNametells Kubernetes to hand the pod to the SUNK Pod Scheduler instead of the default scheduler. -
terminationGracePeriodSecondsmust be less than Slurm’sKillWaitsetting minus 5 seconds. The defaultKillWaitis 30 seconds, so24is a safe value. To check your cluster’sKillWaitvalue:Alternatively, check the live Slurm configuration from a Slurm login node: -
tagsadd thesunklabel to sandbox pods on this profile. -
namespacecreates a separate namespace per user for sandbox pods. -
networkdefines the available egress modes. Sandboxes default tointernetegress.
Add the profile to your values
If you already have asandbox-tower-values.yaml with other profiles, save the profile configuration to a separate file and merge it in with yq:
slurm-profile.yaml
Deploy the updated Runner
Upgrade the Runner deployment:Registration accepted by server and Runner is running in the output.
Step 3: Create sandboxes
To route sandboxes through the SUNK Pod Scheduler, specify theslurm profile ID when creating sandboxes. This targets the profile configured in Step 2.
SandboxDefaults object:
Resource requests and Slurm accounting
SUNK works with whatever resources are on the pod. It does not require or assume any particular Quality of Service class. Whether the sandbox uses Guaranteed QoS (requests equal limits) or Burstable QoS (requests lower than limits), SUNK reads the pod’s resource requests and converts them to Slurm job parameters: CPU requests becomeCPUsPerTask and memory requests become MinMemoryNode. Slurm uses these values for scheduling decisions and accounting.
When you use ResourceOptions to set requests lower than limits, Slurm accounting reflects the requests values while the pod can burst up to limits. For example, a sandbox with requests={"cpu": "500m", "memory": "512Mi"} and limits={"cpu": "2", "memory": "2Gi"} registers 500m CPU and 512Mi memory in sacct, but the sandbox can use up to 2 CPUs and 2Gi memory when capacity is available. See Resources for details on configuring requests and limits with ResourceOptions.
Size resource requests based on what your sandbox workloads need and what capacity is available on the target Nodes after accounting for slurmd requests. See Manage resources for details on how CPU and memory are shared between Kubernetes and Slurm on the same Node.
Control placement with Slurm annotations
Controlling placement with Slurm annotations requirescwsandbox-client>=0.10.0 and tower image >=v0.31.0.
To control sandbox placement, set a sunk.coreweave.com/partition annotation in the sandbox configuration:
Common annotations
The following annotations are commonly used to control sandbox placement:| Annotation | Description |
|---|---|
sunk.coreweave.com/partition | Slurm partition name |
sunk.coreweave.com/constraint | Slurm feature constraint |
sunk.coreweave.com/account | Slurm accounting name |
sunk.coreweave.com/qos | Slurm QoS level |
sunk.coreweave.com/user-id | Slurm user ID for accounting, which must be a numeric Linux UID (for example "1000") |
sunk.coreweave.com/exclusive | Node exclusivity (none or user) |
user-id annotation causes a blocking error that prevents Slurm from scheduling the sandbox. Use id <username> on a Slurm login node to find the numeric UID.
For the full list of annotations, see Schedule Kubernetes pods with the SUNK Pod Scheduler.
Enforce annotations in the profile configuration
Operators can add SUNK annotations to the profile’spod.metadata.annotations to enforce Slurm job parameters for all sandboxes on this profile. For example, to restrict all sandboxes to a sandboxes partition, add the following to the profile configuration:
annotation_conflict error. This prevents users from overriding operator-defined settings such as account, partition, or QoS.
To allow users to set an annotation per-sandbox, leave it out of the profile configuration.
Match the Slurm user from a training job
Training jobs often run with--exclusive=user to claim entire nodes for a single user. This prevents other users’ jobs from competing for resources on those nodes while still allowing the same user to run additional jobs there, such as sandboxes that use spare CPU alongside GPU training.
By default, SUNK placeholder jobs run as root (UID 0). Since root is a different user than the one who submitted the training job, Slurm will not place the sandbox placeholder on the exclusive node.
When training code uses the cwsandbox Python client to create sandboxes from within a running Slurm job, it can read the job’s Slurm user ID from the environment and pass it as an annotation. This ensures the sandbox placeholder jobs are submitted under the same Slurm user as the training job, allowing Slurm to place them on the same exclusive nodes:
user-id annotation must be a numeric Linux UID, not a username. When set, SUNK also defaults the group-id to the same value. Set sunk.coreweave.com/group-id separately if the group ID differs.
Troubleshooting
Nodes temporarily in completing state
SUNK may mark nodes as completing (CG state) for 30-60 seconds during sandbox cleanup. This does not affect running training jobs on the same node.Sandboxes not landing where expected
Sandbox placement is determined by Slurm based on the annotations set in the profile configuration or passed from the client. If sandboxes are not landing on the expected nodes, verify the Slurm job parameters. SUNK creates a placeholder Slurm job for each sandbox pod with the name<namespace>/<pod-name>. The pod name includes the sandbox ID, which is available from the client via sb.sandbox_id. Find the placeholder job and inspect its parameters by searching for the sandbox ID:
[SANDBOX-ID] with the value of sb.sandbox_id from the Python client.