Create a SUNK cluster - CoreWeave Docs

CoreWeave SUNK provides a managed, self-service path for running Slurm on Kubernetes. You declare a cluster once, and an operator provisions the compute, the Slurm control plane, login pods, shared storage, and user access for you. By the end of this guide, you have a running SUNK cluster that your users can connect to over SSH and submit Slurm jobs against. You can create a managed, self-service SUNK cluster in two ways:

Cloud Console: a guided form with a live YAML preview. The Console performs additional setup for you, including creating a CKS cluster if you don’t already have one, enabling the SCIM API for user provisioning, and creating an IAM group.
SunkCluster custom resource (CR): apply a YAML manifest directly to your CKS cluster. This path requires extra, manual setup steps to provision user access.

Both paths submit the same SunkCluster resource. The operator reconciles it into the underlying NodePools, NodeSets, and SlurmCluster resource that make up a running cluster.

Choose Console or `SunkCluster` CR

Before you begin, decide which path fits your workflow. The Console is the fastest way to a working cluster. The SunkCluster CR path gives you direct control over the manifest and fits GitOps workflows. When you create a new SUNK cluster in the Console, the Console automatically performs additional setup steps for you. If you choose the SunkCluster CR path, you must perform the SCIM API enablement, SCIM token, and IAM group steps yourself. See SUNK user provisioning and nsscache reference for the user-provisioning building blocks. Creating a new SUNK cluster requires the following steps:

Setup step	Cloud Console	`SunkCluster` CR
Create a CKS cluster (when you don’t already have one)	Automatic	Manual
Create `NodePools` for compute and control plane	Automatic	Automatic (operator)
Enable the SCIM API, and create a SCIM token and Kubernetes secret for user provisioning	Automatic	Manual
Create the IAM group and assign membership	Automatic	Manual
Apply the `SunkCluster` resource	Automatic	Manual

Prerequisites

Before you create a SUNK cluster, verify that you have:

Sufficient compute and CPU quota for the nodes and the Slurm control plane.
An SSH public key on each user’s CoreWeave profile. Without an SSH public key, a user added to a Slurm group can’t connect to the login pod over SSH, and receives a generic “permission denied” error when they attempt the connection. See Connect to a Slurm login node.

For the SunkCluster CR path only, you also need:

An existing CKS cluster you can apply manifests against. See Create a CKS cluster.
The tenant-slurm namespace created in that cluster. The operator reads the resource from this namespace.
The SCIM API enabled and a SCIM token stored as a Kubernetes secret in the cluster. See SUNK user provisioning.

Create the cluster

Select the appropriate tab to view the steps for your chosen method.

Cloud Console
SunkCluster CR

Open the SUNK page

Sign in to the CoreWeave Cloud Console and go to the SUNK page. Select Create cluster.The form is on the left and a live YAML preview is on the right. Anything you change in the form updates the YAML, and any field you add directly to the YAML is included when you submit.

Choose a CKS cluster

Select Create new to provision a new CKS cluster, or pick an existing one from the dropdown. If you don’t have any existing CKS clusters, the existing-cluster option is unavailable.

Set compute capacity

Each entry in the compute capacity list becomes one managed NodePool and one NodeSet in your cluster. The operator keeps the NodeSet count in sync with the NodePool count, so changing Nodes later scales the cluster automatically. Counts are measured in Nodes, not GPUs.By default, the form populates with all of your available quota. Reduce the count for any node type you don’t want to use, or remove the row entirely. Removed types remain available for future use, up to your quota.

Provision access

Confirm that your SSH public key is attached to your profile. The Console reads the key from your Profile page under Update Slurm attributes.Choose the groups that should have access to the cluster:

Slurm users receive standard access.
Sudo users can run privileged commands.

Select Default to have the Console create groups named slurm-users and sudo-users for you. To grant a person access after the cluster is up, add them to one of the selected groups in Administration > Groups.

A user must add their own SSH public key to their CoreWeave profile before they can connect to the login pod over SSH. Adding them to a group provisions the user account inside Slurm, but doesn’t place a key on the login node.

Size the control plane and login pods

Choose CPU resources for the control plane. The control plane runs the Slurm control plane workloads and the login pods.You can size login pods per-user and per-group. See the SunkCluster parameter reference for more information about user and group settings.If the form reports that your CPU quota is insufficient, you can:

Disable userPods for select group(s).
Disable groupPod for select group(s).
Reduce the number of users with login pods.
Reduce the per-pod login resources.
Request a quota increase.

Choose a public or private endpoint

The endpoint type controls how users reach the Slurm login pod from outside the cluster. A public endpoint exposes the login service on a routable IP so users can connect directly over SSH. A private endpoint has no public IP and requires a separate ingress mechanism such as Tailscale, which is configured by editing the YAML preview directly.

Configure shared storage

Set the size of the home directory mount and add additional mounts as needed. Shared storage is pooled across all users in the cluster, not per group. Users typically create their own subpaths under each mount.

Submit

Review the YAML preview, optionally copy it for source control, and select Submit. The Console creates the CKS cluster (if requested), provisions Node Pools, sets up SCIM and IAM, and submits the SunkCluster resource to the tenant-slurm namespace.After submitting, the operator begins reconciling the cluster. See Verify the cluster is ready to track progress.

Confirm the operator is installed

Check that the SunkCluster CRD is installed in your CKS cluster:

kubectl get crd sunkclusters.sunk.coreweave.com

If the CRD is missing, the SUNK operator hasn’t been deployed to your cluster. Contact your CoreWeave Solutions Architect.

Create the namespace

kubectl create namespace tenant-slurm

The operator reconciles SunkCluster resources from this namespace.

Author the SunkCluster manifest

Save the following manifest, replacing the bracketed placeholders. See SUNK and Slurm versions for supported versions.

apiVersion: sunk.coreweave.com/v1alpha1
kind: SunkCluster
metadata:
  name: [CLUSTER-NAME]
  namespace: tenant-slurm
spec:
  sunkVersion: "[SUNK-VERSION]"
  slurmVersion: "[SLURM-VERSION]"
  ubuntuVersion: "24.04"

  nodes:
    - name: control-plane
      instanceType: cd-gp-a192-genoa
      count: 2
      controlPlane: true
    - name: gpu
      instanceType: gd-8xh100ib-i128
      count: 4
      controlPlane: false

  storage:
    homeDir:
      path: /mnt/home
      size: 1Ti
    additionalMounts:
      - path: /mnt/data
        size: 1Ti

  login:
    groups:
      - name: slurm-users
        # userPods and groupPod default to true
      - name: sudo-users
        userPods: false
        groupPod: true
    sudoGroups:
      - sudo-users
    access:
      annotations:
        example.com/annotation: "value"
    groupPods:
      resources:
        memory: 8Gi
        cpu: 4
    userPods:
      resources:
        memory: 8Gi
        cpu: 4

Each entry in spec.nodes becomes one managed NodePool and one NodeSet. The name and instanceType fields are immutable after creation. To replace a node type, add a new entry and remove the old one. The count field is mutable and scales the underlying NodePool. Set controlPlane: true on entries that should host the Slurm control plane.For the full list of fields, see SunkCluster reference.

Apply the manifest

kubectl apply -f sunkcluster.yaml

For GitOps, commit the manifest and point your continuous-delivery tool at the tenant-slurm namespace. See Manage deployments with CI and GitOps.After applying, the operator begins reconciling the SunkCluster resource into a running cluster. See Verify the cluster is ready to track progress.

Verify the cluster is ready

After you submit, the operator works asynchronously to provision the cluster. Use the following steps to confirm that the cluster has finished reconciling and is ready for users. A new SUNK cluster typically takes around 40 minutes to come up. Node provisioning accounts for most of that time. To check progress, inspect the SunkCluster resource and its status conditions:

kubectl get sunkcluster -n tenant-slurm -o yaml

The cluster is ready when the Ready condition is True. The aggregate Ready condition is only True when each of its dependent conditions is True. While the cluster is coming up for the first time, the Ready condition reports reason: Bootstrapping and a message that lists the dependent conditions still pending (for example, Waiting for conditions: NodePoolsAvailable, NodeSetsAvailable, SlurmClusterAvailable). Each dependent condition reports reason: InProgress while it is in progress, reason: Ready when satisfied, or reason: Error when it fails. For full descriptions of each condition, the reason values, and the underlying SlurmCluster conditions that SlurmClusterAvailable aggregates, see the SunkCluster reference. You can also inspect the underlying resources directly:

kubectl get slurmcluster -n tenant-slurm
kubectl get nodepools
kubectl get nodesets -n tenant-slurm

Once the cluster is ready, see Connect to a Slurm login node to log in and submit jobs.

Troubleshoot

The following sections describe common errors you may encounter while creating or verifying a SUNK cluster.

`kubectl` returns `Error from server (NotFound)` for `SunkCluster`

The SUNK operator hasn’t been deployed to the CKS cluster, so the SunkCluster CRD is missing. Verify with:

kubectl get crd sunkclusters.sunk.coreweave.com

If the CRD doesn’t exist, contact your CoreWeave Solutions Architect to enable the operator on your cluster.

A user with group access gets `Permission denied` when they connect over SSH

The user might be in a group that hasn’t been given access to the cluster. See SUNK user provisioning for details about user and group permissions. If the user has been assigned to a group with cluster access and still can’t connect to the login pod over SSH, verify that the user has added an SSH public key to their CoreWeave profile. Slurm provisions the user account from group membership, but the login node only permits SSH from a key that is attached to the user’s profile. Ask the user to add a key on their Profile page under Update Slurm attributes, then retry the SSH connection.

The `SunkCluster` is stuck with `Ready: False`

Check the conditions on the resource for the first dependent condition that isn’t True. The reason and message on each condition identify the underlying problem. A common cause is insufficient quota holding NodePoolsAvailable in reason: InProgress. If SlurmClusterAvailable is the blocker, inspect the SlurmCluster directly to see which subcomponent is pending:

kubectl get slurmcluster -n tenant-slurm -o yaml

Next steps

After your cluster is running, use the following resources to configure and manage it:

SunkCluster reference for every supported field.
SUNK and Slurm versions for supported version combinations.
Connect to a Slurm login node to start running jobs.
Configure individual login pods to tune login resources.
Enable GPU straggler detection for long training runs.

Documentation Index

​Choose Console or SunkCluster CR

​Prerequisites

​Create the cluster

​Verify the cluster is ready

​Troubleshoot

​kubectl returns Error from server (NotFound) for SunkCluster

​A user with group access gets Permission denied when they connect over SSH

​The SunkCluster is stuck with Ready: False

​Next steps

Choose Console or `SunkCluster` CR

Prerequisites

Create the cluster

Verify the cluster is ready

Troubleshoot

`kubectl` returns `Error from server (NotFound)` for `SunkCluster`

A user with group access gets `Permission denied` when they connect over SSH

The `SunkCluster` is stuck with `Ready: False`

Next steps