Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt

Use this file to discover all available pages before exploring further.

CoreWeave SUNK provides a managed, self-service path for running Slurm on Kubernetes. You declare a cluster once, and an operator provisions the compute, the Slurm control plane, login pods, shared storage, and user access for you. By the end of this guide, you have a running SUNK cluster that your users can SSH into and submit Slurm jobs against. There are two ways to create a managed, self-service SUNK cluster:
  • Cloud Console: a guided form with a live YAML preview. The Console performs additional setup for you, including creating a CKS cluster if you do not already have one, enabling the SCIM API for user provisioning, and creating an IAM group.
  • SunkCluster custom resource (CR): apply a YAML manifest directly to your CKS cluster. This path requires extra, manual setup steps to provision user access.
Both paths submit the same SunkCluster resource. The operator reconciles it into the underlying NodePools, NodeSets, and SlurmCluster resource that make up a running cluster.

Choose Console or SunkCluster CR

When creating a new SUNK cluster in the Console, additional setup steps are automatically performed for you. If you choose the SunkCluster CR path, you must perform the SCIM API enablement, SCIM token, and IAM group steps yourself. See SUNK user provisioning and nsscache reference for the user-provisioning building blocks. Creating a new SUNK cluster requires the following steps:
Setup stepCloud ConsoleSunkCluster CR
Create a CKS cluster (when you don’t already have one)AutomaticManual
Create NodePools for compute and control planeAutomaticAutomatic (operator)
Enable the SCIM API, and create a SCIM token and Kubernetes secret for user provisioningAutomaticManual
Create the IAM group and assign membershipAutomaticManual
Apply the SunkCluster resourceAutomaticManual

Prerequisites

Before you create a SUNK cluster, verify that you have:
  • Sufficient compute and CPU quota for the nodes and the Slurm control plane.
  • An SSH public key on each user’s CoreWeave profile. Without an SSH public key, a user added to a Slurm group cannot SSH in to the login pod, and receives a generic “permission denied” error when they attempt to SSH. See Connect to a Slurm login node.
For the SunkCluster CR path only, you also need:
  • An existing CKS cluster you can apply manifests against. See Create a CKS cluster.
  • The tenant-slurm namespace created in that cluster. The operator expects to find the resource in this namespace.
  • The SCIM API enabled and a SCIM token stored as a Kubernetes secret in the cluster. See SUNK user provisioning.

Create the cluster

Select the appropriate tab below to view the steps for your chosen method.
1

Open the SUNK page

Sign in to the CoreWeave Cloud Console and go to the SUNK page. Select Create cluster.The form is on the left and a live YAML preview is on the right. Anything you change in the form updates the YAML, and any field you add directly to the YAML is included when you submit.
2

Choose a CKS cluster

Select Create new to provision a new CKS cluster, or pick an existing one from the dropdown. If you do not have any existing CKS clusters, the existing-cluster option is unavailable.
3

Set compute capacity

Each entry in the compute capacity list becomes one managed NodePool and one NodeSet in your cluster. The operator keeps the NodeSet count in sync with the NodePool count, so changing Nodes later scales the cluster automatically. Counts are measured in Nodes, not GPUs.By default, the form populates with all of your available quota. Reduce the count for any node type you do not want to use, or remove the row entirely. Removed types remain available for future use, up to your quota.
4

Provision access

Confirm that your SSH public key is attached to your profile. The Console reads the key from your Profile page under Update Slurm attributes.Choose the groups that should have access to the cluster:
  • Slurm users receive standard access.
  • Sudo users can run privileged commands.
Select Default to have the Console create groups named slurm-users and sudo-users for you. To grant a person access after the cluster is up, add them to one of the selected groups in Administration > Groups.
A user must add their own SSH public key to their CoreWeave profile before they can SSH in to the login pod. Adding them to a group provisions the user account inside Slurm, but does not place a key on the login node.
5

Size the control plane and login pods

Choose CPU resources for the control plane. The control plane runs the Slurm control plane workloads and the login pods.You can size login pods per-user and per-group. See the SunkCluster parameter reference for more information about user and group settings.If the form reports that your CPU quota is insufficient, you can:
  • Disable userPods for select group(s).
  • Disable groupPod for select group(s).
  • Reduce the number of users with login pods.
  • Reduce the per-pod login resources.
  • Request a quota increase.
6

Choose a public or private endpoint

The endpoint type controls how users reach the Slurm login pod from outside the cluster. A public endpoint exposes the login service on a routable IP so users can SSH directly. A private endpoint has no public IP and requires a separate ingress mechanism such as Tailscale, which is configured by editing the YAML preview directly.
7

Configure shared storage

Set the size of the home directory mount and add additional mounts as needed. Shared storage is pooled across all users in the cluster, not per group; users typically create their own subpaths under each mount.
8

Submit

Review the YAML preview, optionally copy it for source control, and select Submit. The Console creates the CKS cluster (if requested), provisions Node Pools, sets up SCIM and IAM, and submits the SunkCluster resource to the tenant-slurm namespace.After submitting, the operator begins reconciling the cluster. Proceed to Verify the cluster is ready below to track progress.

Verify the cluster is ready

A new SUNK cluster typically takes around 40 minutes to come up; node provisioning accounts for most of that time. Inspect the SunkCluster resource and its status conditions:
kubectl get sunkcluster -n tenant-slurm -o yaml
The cluster is ready when the Ready condition is True. The aggregate Ready condition is only True when each of its dependent conditions are True. While the cluster is coming up for the first time, the Ready condition reports reason: Bootstrapping and a message that lists the dependent conditions still pending (for example, Waiting for conditions: NodePoolsAvailable, NodeSetsAvailable, SlurmClusterAvailable). Each dependent condition reports reason: InProgress while it is in progress, reason: Ready when satisfied, or reason: Error when it fails. For full descriptions of each condition, the reason values, and the underlying SlurmCluster conditions that SlurmClusterAvailable aggregates, see the SunkCluster reference. You can also inspect the underlying resources directly:
kubectl get slurmcluster -n tenant-slurm
kubectl get nodepools
kubectl get nodesets -n tenant-slurm
Once the cluster is ready, see Connect to a Slurm login node to log in and submit jobs.

Troubleshoot

This section covers common errors you may encounter while creating or verifying a SUNK cluster.

kubectl returns Error from server (NotFound) for SunkCluster

The SUNK operator has not been deployed to the CKS cluster, so the SunkCluster CRD is missing. Verify with:
kubectl get crd sunkclusters.sunk.coreweave.com
If the CRD does not exist, contact your CoreWeave Solutions Architect to enable the operator on your cluster.

A user with group access gets Permission denied when they SSH

The user may be in a group that has not been given access to the cluster. See SUNK User Provisioning for details about user and group permissions. If the user has been assigned to a group with cluster access and still cannot SSH into the login pod, verify that the user has added an SSH public key to their CoreWeave profile. Slurm provisions the user account from group membership, but the login node only permits SSH from a key that is attached to the user’s profile. Ask the user to add a key on their Profile page under Update Slurm attributes, then retry the SSH connection.

The SunkCluster is stuck with Ready: False

Check the conditions on the resource for the first dependent condition that is not True. The reason and message on each condition identify the underlying problem. A common cause is insufficient quota holding NodePoolsAvailable in reason: InProgress. If SlurmClusterAvailable is the blocker, inspect the SlurmCluster directly to see which subcomponent is pending:
kubectl get slurmcluster -n tenant-slurm -o yaml

Next steps

After your cluster is running, use the following resources to configure and manage it:
Last modified on May 8, 2026