CoreWeave SUNK provides a managed, self-service path for running Slurm on Kubernetes. You declare a cluster once, and an operator provisions the compute, the Slurm control plane, login pods, shared storage, and user access for you. By the end of this guide, you have a running SUNK cluster that your users can SSH into and submit Slurm jobs against. There are two ways to create a managed, self-service SUNK cluster:Documentation Index
Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
Use this file to discover all available pages before exploring further.
- Cloud Console: a guided form with a live YAML preview. The Console performs additional setup for you, including creating a CKS cluster if you do not already have one, enabling the SCIM API for user provisioning, and creating an IAM group.
SunkClustercustom resource (CR): apply a YAML manifest directly to your CKS cluster. This path requires extra, manual setup steps to provision user access.
SunkCluster resource. The operator reconciles it into the underlying NodePools, NodeSets, and SlurmCluster resource that make up a running cluster.
Choose Console or SunkCluster CR
When creating a new SUNK cluster in the Console, additional setup steps are automatically performed for you.
If you choose the SunkCluster CR path, you must perform the SCIM API enablement, SCIM token, and IAM group steps yourself. See SUNK user provisioning and nsscache reference for the user-provisioning building blocks.
Creating a new SUNK cluster requires the following steps:
| Setup step | Cloud Console | SunkCluster CR |
|---|---|---|
| Create a CKS cluster (when you don’t already have one) | Automatic | Manual |
Create NodePools for compute and control plane | Automatic | Automatic (operator) |
| Enable the SCIM API, and create a SCIM token and Kubernetes secret for user provisioning | Automatic | Manual |
| Create the IAM group and assign membership | Automatic | Manual |
Apply the SunkCluster resource | Automatic | Manual |
Prerequisites
Before you create a SUNK cluster, verify that you have:- Sufficient compute and CPU quota for the nodes and the Slurm control plane.
- An SSH public key on each user’s CoreWeave profile. Without an SSH public key, a user added to a Slurm group cannot SSH in to the login pod, and receives a generic “permission denied” error when they attempt to SSH. See Connect to a Slurm login node.
SunkCluster CR path only, you also need:
- An existing CKS cluster you can apply manifests against. See Create a CKS cluster.
- The
tenant-slurmnamespace created in that cluster. The operator expects to find the resource in this namespace. - The SCIM API enabled and a SCIM token stored as a Kubernetes secret in the cluster. See SUNK user provisioning.
Create the cluster
Select the appropriate tab below to view the steps for your chosen method.- Cloud Console
- SunkCluster CR
Open the SUNK page
Sign in to the CoreWeave Cloud Console and go to the SUNK page. Select Create cluster.The form is on the left and a live YAML preview is on the right. Anything you change in the form updates the YAML, and any field you add directly to the YAML is included when you submit.
Choose a CKS cluster
Select Create new to provision a new CKS cluster, or pick an existing one from the dropdown. If you do not have any existing CKS clusters, the existing-cluster option is unavailable.
Set compute capacity
Each entry in the compute capacity list becomes one managed
NodePool and one NodeSet in your cluster. The operator keeps the NodeSet count in sync with the NodePool count, so changing Nodes later scales the cluster automatically. Counts are measured in Nodes, not GPUs.By default, the form populates with all of your available quota. Reduce the count for any node type you do not want to use, or remove the row entirely. Removed types remain available for future use, up to your quota.Provision access
Confirm that your SSH public key is attached to your profile. The Console reads the key from your Profile page under Update Slurm attributes.Choose the groups that should have access to the cluster:
- Slurm users receive standard access.
- Sudo users can run privileged commands.
slurm-users and sudo-users for you. To grant a person access after the cluster is up, add them to one of the selected groups in Administration > Groups.A user must add their own SSH public key to their CoreWeave profile before they can SSH in to the login pod. Adding them to a group provisions the user account inside Slurm, but does not place a key on the login node.
Size the control plane and login pods
Choose CPU resources for the control plane. The control plane runs the Slurm control plane workloads and the login pods.You can size login pods per-user and per-group. See the
SunkCluster parameter reference for more information about user and group settings.If the form reports that your CPU quota is insufficient, you can:- Disable
userPodsfor select group(s). - Disable
groupPodfor select group(s). - Reduce the number of users with login pods.
- Reduce the per-pod login resources.
- Request a quota increase.
Choose a public or private endpoint
The endpoint type controls how users reach the Slurm login pod from outside the cluster. A public endpoint exposes the login service on a routable IP so users can SSH directly. A private endpoint has no public IP and requires a separate ingress mechanism such as Tailscale, which is configured by editing the YAML preview directly.
Configure shared storage
Set the size of the home directory mount and add additional mounts as needed. Shared storage is pooled across all users in the cluster, not per group; users typically create their own subpaths under each mount.
Submit
Review the YAML preview, optionally copy it for source control, and select Submit. The Console creates the CKS cluster (if requested), provisions Node Pools, sets up SCIM and IAM, and submits the
SunkCluster resource to the tenant-slurm namespace.After submitting, the operator begins reconciling the cluster. Proceed to Verify the cluster is ready below to track progress.Verify the cluster is ready
A new SUNK cluster typically takes around 40 minutes to come up; node provisioning accounts for most of that time. Inspect theSunkCluster resource and its status conditions:
Ready condition is True. The aggregate Ready condition is only True when each of its dependent conditions are True.
While the cluster is coming up for the first time, the Ready condition reports reason: Bootstrapping and a message that lists the dependent conditions still pending (for example, Waiting for conditions: NodePoolsAvailable, NodeSetsAvailable, SlurmClusterAvailable). Each dependent condition reports reason: InProgress while it is in progress, reason: Ready when satisfied, or reason: Error when it fails.
For full descriptions of each condition, the reason values, and the underlying SlurmCluster conditions that SlurmClusterAvailable aggregates, see the SunkCluster reference.
You can also inspect the underlying resources directly:
Troubleshoot
This section covers common errors you may encounter while creating or verifying a SUNK cluster.kubectl returns Error from server (NotFound) for SunkCluster
The SUNK operator has not been deployed to the CKS cluster, so the SunkCluster CRD is missing. Verify with:
A user with group access gets Permission denied when they SSH
The user may be in a group that has not been given access to the cluster. See SUNK User Provisioning for details about user and group permissions.
If the user has been assigned to a group with cluster access and still cannot SSH into the login pod, verify that the user has added an SSH public key to their CoreWeave profile. Slurm provisions the user account from group membership, but the login node only permits SSH from a key that is attached to the user’s profile. Ask the user to add a key on their Profile page under Update Slurm attributes, then retry the SSH connection.
The SunkCluster is stuck with Ready: False
Check the conditions on the resource for the first dependent condition that is not True. The reason and message on each condition identify the underlying problem. A common cause is insufficient quota holding NodePoolsAvailable in reason: InProgress.
If SlurmClusterAvailable is the blocker, inspect the SlurmCluster directly to see which subcomponent is pending:
Next steps
After your cluster is running, use the following resources to configure and manage it:- SunkCluster reference for every supported field.
- SUNK and Slurm versions for supported version combinations.
- Connect to a Slurm login node to start running jobs.
- Configure individual login pods to tune login resources.
- Enable GPU straggler detection for long training runs.