> ## Documentation Index
> Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Create a SUNK cluster

> Create a managed self-service SUNK cluster on CKS using the Cloud Console or a SunkCluster custom resource

CoreWeave SUNK provides a managed, self-service path for running Slurm on Kubernetes. You declare a cluster once, and an operator provisions the compute, the Slurm control plane, login pods, shared storage, and user access for you. By the end of this guide, you have a running SUNK cluster that your users can connect to over SSH and submit Slurm jobs against.

You can create a managed, self-service SUNK cluster in two ways:

* **Cloud Console**: a guided form with a live YAML preview. The Console performs additional setup for you, including creating a CKS cluster if you don't already have one, enabling the SCIM API for user provisioning, and creating an IAM group.
* **`SunkCluster` custom resource (CR)**: apply a YAML manifest directly to your CKS cluster. This path requires extra, manual setup steps to provision user access.

Both paths submit the same `SunkCluster` resource. The operator reconciles it into the underlying `NodePools`, `NodeSets`, and `SlurmCluster` resource that make up a running cluster.

## Choose Console or `SunkCluster` CR

Before you begin, decide which path fits your workflow. The Console is the fastest way to a working cluster. The `SunkCluster` CR path gives you direct control over the manifest and fits GitOps workflows.

When you create a new SUNK cluster in the Console, the Console automatically performs additional setup steps for you.

If you choose the `SunkCluster` CR path, you must perform the SCIM API enablement, SCIM token, and IAM group steps yourself. See [SUNK user provisioning](/products/sunk/manage_sunk/manage_cluster_access/sunk_user_provisioning) and [nsscache reference](/products/sunk/reference/nsscache_reference) for the user-provisioning building blocks.

Creating a new SUNK cluster requires the following steps:

| Setup step                                                                               | Cloud Console | `SunkCluster` CR     |
| ---------------------------------------------------------------------------------------- | ------------- | -------------------- |
| Create a CKS cluster (when you don't already have one)                                   | Automatic     | Manual               |
| Create `NodePools` for compute and control plane                                         | Automatic     | Automatic (operator) |
| Enable the SCIM API, and create a SCIM token and Kubernetes secret for user provisioning | Automatic     | Manual               |
| Create the IAM group and assign membership                                               | Automatic     | Manual               |
| Apply the `SunkCluster` resource                                                         | Automatic     | Manual               |

## Prerequisites

Before you create a SUNK cluster, verify that you have:

* Sufficient compute and CPU [quota](/products/cks/clusters/quotas) for the nodes and the Slurm control plane.
* An SSH public key on each user's CoreWeave profile. Without an SSH public key, a user added to a Slurm group can't connect to the login pod over SSH, and receives a generic "permission denied" error when they attempt the connection. See [Connect to a Slurm login node](/products/sunk/access_sunk/connect-to-slurm-login-node).

For the `SunkCluster` CR path only, you also need:

* An existing CKS cluster you can apply manifests against. See [Create a CKS cluster](/products/cks/clusters/create).
* The `tenant-slurm` namespace created in that cluster. The operator expects to find the resource in this namespace.
* The SCIM API and SUNK User Provisioning enabled, and a SCIM token stored as a Kubernetes secret in the `tenant-slurm` namespace. Unlike the Console path, you create the token and secret yourself. Inside the secret, store the token under the key `nsscache-scim-auth-token`. See [SUNK user provisioning](/products/sunk/manage_sunk/manage_cluster_access/sunk_user_provisioning) and the [nsscache reference](/products/sunk/reference/nsscache_reference).

## Create the cluster

Select the appropriate tab to view the steps for your chosen method.

<Tabs>
  <Tab title="Cloud Console">
    <Steps>
      <Step title="Open the SUNK page">
        Sign in to the [CoreWeave Cloud Console](https://console.coreweave.com/sunk) and go to the **SUNK** page. Select **Create cluster**.

        The form is on the left and a live YAML preview is on the right. Anything you change in the form updates the YAML, and any field you add directly to the YAML is included when you submit.
      </Step>

      <Step title="Choose a CKS cluster">
        Select **Create new** to provision a new CKS cluster, or pick an existing one from the dropdown. If you don't have any existing CKS clusters, the existing-cluster option is unavailable.
      </Step>

      <Step title="Set compute capacity">
        Each entry in the compute capacity list becomes one managed `NodePool` and one `NodeSet` in your cluster. The operator keeps the `NodeSet` count in sync with the `NodePool` count, so changing **Nodes** later scales the cluster automatically. Counts are measured in Nodes, not GPUs.

        By default, the form populates with all of your available quota. Reduce the count for any node type you don't want to use, or remove the row entirely. Removed types remain available for future use, up to your quota.
      </Step>

      <Step title="Provision access">
        Confirm that your SSH public key is attached to your profile. The Console reads the key from your [Profile page](https://console.coreweave.com/account/settings) under **Update Slurm attributes**.

        Choose the groups that should have access to the cluster:

        * **Slurm users** receive standard access.
        * **Sudo users** can run privileged commands.

        Select **Default** to have the Console create groups named `slurm-users` and `sudo-users` for you. To grant a person access after the cluster is up, add them to one of the selected groups in **Administration** > **Groups**.

        <Note>
          A user must add their own SSH public key to their CoreWeave profile before they can connect to the login pod over SSH. Adding them to a group provisions the user account inside Slurm, but doesn't place a key on the login node.
        </Note>
      </Step>

      <Step title="Size the control plane and login pods">
        Choose CPU resources for the control plane. The control plane runs the Slurm control plane workloads and the login pods.

        SUNK provides two kinds of login pod: **per-user pods** (the `userPods` field) give each user in a group their own login pod, while a **shared group pod** (the `groupPod` field) is shared by all users in a group. You configure both on the `SunkCluster` resource under `spec.login`, and can enable either or both per group and size them independently. For the full list of user and group settings, see the [`SunkCluster` parameter reference](/products/sunk/reference/sunkcluster-reference#logingroup).

        If the form reports that your CPU quota is insufficient, you can:

        * Disable `userPods` for select group(s).
        * Disable `groupPod` for select group(s).
        * Reduce the number of users with login pods.
        * Reduce the per-pod login resources.
        * Request a quota increase.
      </Step>

      <Step title="Choose a public or private endpoint">
        The endpoint type controls how users reach the Slurm login pod from outside the cluster. A **public** endpoint exposes the login service on a routable IP so users can connect directly over SSH. A **private** endpoint has no public IP and requires a separate ingress mechanism such as Tailscale, which is configured by editing the YAML preview directly.
      </Step>

      <Step title="Configure shared storage">
        Set the size of the home directory mount and add additional mounts as needed. Shared storage is pooled across all users in the cluster, not per group. Users typically create their own subpaths under each mount.
      </Step>

      <Step title="Submit">
        Review the YAML preview, optionally copy it for source control, and select **Submit**. The Console creates the CKS cluster (if requested), provisions Node Pools, enables the SCIM API, creates a SCIM token and Kubernetes secret for user provisioning, creates the IAM group, and submits the `SunkCluster` resource to the `tenant-slurm` namespace. The Console names the SCIM token for you and sets it to expire after 364 days.

        After submitting, the operator begins reconciling the cluster. See [Verify the cluster is ready](#verify-the-cluster-is-ready) to track progress.
      </Step>
    </Steps>
  </Tab>

  <Tab title="SunkCluster CR">
    <Steps>
      <Step title="Confirm the operator is installed">
        Check that the `SunkCluster` CRD is installed in your CKS cluster:

        ```bash theme={"system"}
        kubectl get crd sunkclusters.sunk.coreweave.com
        ```

        If the CRD is missing, the SUNK operator hasn't been deployed to your cluster. Contact your CoreWeave Solutions Architect.
      </Step>

      <Step title="Create the namespace">
        ```bash theme={"system"}
        kubectl create namespace tenant-slurm
        ```

        The operator reconciles `SunkCluster` resources from this namespace.
      </Step>

      <Step title="Author the SunkCluster manifest">
        Save the following manifest, replacing the bracketed placeholders. See [SUNK and Slurm versions](/products/sunk/reference/sunk-slurm-versions) for supported versions.

        ```yaml theme={"system"}
        apiVersion: sunk.coreweave.com/v1alpha1
        kind: SunkCluster
        metadata:
          name: [CLUSTER-NAME]
          namespace: tenant-slurm
        spec:
          sunkVersion: "[SUNK-VERSION]"
          slurmVersion: "[SLURM-VERSION]"
          ubuntuVersion: "24.04"

          nodes:
            - name: control-plane
              instanceType: cd-gp-a192-genoa
              count: 2
              controlPlane: true
            - name: gpu
              instanceType: gd-8xh100ib-i128
              count: 4
              controlPlane: false

          storage:
            homeDir:
              path: /mnt/home
              size: 1Ti
            additionalMounts:
              - path: /mnt/data
                size: 1Ti

          login:
            groups:
              - name: slurm-users
                # userPods and groupPod default to true
              - name: sudo-users
                userPods: false
                groupPod: true
            sudoGroups:
              - sudo-users
            access:
              annotations:
                example.com/annotation: "value"
            groupPods:
              resources:
                memory: 8Gi
                cpu: 4
            userPods:
              resources:
                memory: 8Gi
                cpu: 4
        ```

        Each entry in `spec.nodes` becomes one managed NodePool and one NodeSet. The `name` and `instanceType` fields are immutable after creation. To replace a node type, add a new entry and remove the old one. The `count` field is mutable and scales the underlying NodePool. Set `controlPlane: true` on entries that should host the Slurm control plane.

        The `login` block controls login pods: `userPods` gives each user in a group their own login pod, and `groupPod` creates a pod shared by the group. List sudo-enabled groups under `sudoGroups`. Login pods are defined entirely on the `SunkCluster` resource; there is no separate login chart to deploy. See the [`SunkCluster` parameter reference](/products/sunk/reference/sunkcluster-reference#logingroup) for defaults and all login fields.

        For the full list of fields, see [SunkCluster reference](/products/sunk/reference/sunkcluster-reference).
      </Step>

      <Step title="Apply the manifest">
        ```bash theme={"system"}
        kubectl apply -f sunkcluster.yaml
        ```

        For GitOps, commit the manifest and point your continuous-delivery tool at the `tenant-slurm` namespace. See [Manage deployments with CI and GitOps](/products/sunk/deploy_sunk/manage-deployment-with-ci).

        After applying, the operator begins reconciling the `SunkCluster` resource into a running cluster. See [Verify the cluster is ready](#verify-the-cluster-is-ready) to track progress.
      </Step>
    </Steps>
  </Tab>
</Tabs>

## Verify the cluster is ready

After you submit, the operator works asynchronously to provision the cluster. Use the following steps to confirm that the cluster has finished reconciling and is ready for users.

A new SUNK cluster typically takes around 40 minutes to come up. Node provisioning accounts for most of that time.

To check progress, inspect the `SunkCluster` resource and its status conditions:

```bash theme={"system"}
kubectl get sunkcluster -n tenant-slurm -o yaml
```

The cluster is ready when the `Ready` condition is `True`. The aggregate `Ready` condition is only `True` when each of its dependent conditions is `True`.

While the cluster is coming up for the first time, the `Ready` condition reports `reason: Bootstrapping` and a `message` that lists the dependent conditions still pending (for example, `Waiting for conditions: NodePoolsAvailable, NodeSetsAvailable, SlurmClusterAvailable`). Each dependent condition reports `reason: InProgress` while it is in progress, `reason: Ready` when satisfied, or `reason: Error` when it fails.

For full descriptions of each condition, the `reason` values, and the underlying `SlurmCluster` conditions that `SlurmClusterAvailable` aggregates, see the [SunkCluster reference](/products/sunk/reference/sunkcluster-reference#status-conditions).

You can also inspect the underlying resources directly:

```bash theme={"system"}
kubectl get slurmcluster -n tenant-slurm
kubectl get nodepools
kubectl get nodesets -n tenant-slurm
```

Once the cluster is ready, see [Connect to a Slurm login node](/products/sunk/access_sunk/connect-to-slurm-login-node) to log in and submit jobs.

## Troubleshoot

The following sections describe common errors you may encounter while creating or verifying a SUNK cluster.

### `kubectl` returns `Error from server (NotFound)` for `SunkCluster`

The SUNK operator hasn't been deployed to the CKS cluster, so the `SunkCluster` CRD is missing. Verify with:

```bash theme={"system"}
kubectl get crd sunkclusters.sunk.coreweave.com
```

If the CRD doesn't exist, contact your CoreWeave Solutions Architect to enable the operator on your cluster.

### A user with group access gets `Permission denied` when they connect over SSH

The user might be in a group that hasn't been given access to the cluster. See [SUNK user provisioning](/products/sunk/manage_sunk/manage_cluster_access/sunk_user_provisioning) for details about user and group permissions.

If the user has been assigned to a group with cluster access and still can't connect to the login pod over SSH, verify that the user has added an SSH public key to their CoreWeave profile. Slurm provisions the user account from group membership, but the login node only permits SSH from a key that is attached to the user's profile. Ask the user to add a key on their [Profile page](https://console.coreweave.com/account/settings) under **Update Slurm attributes**, then retry the SSH connection.

### The `SunkCluster` is stuck with `Ready: False`

Check the [conditions](/products/sunk/reference/sunkcluster-reference#status-conditions) on the resource for the first dependent condition that isn't `True`. The `reason` and `message` on each condition identify the underlying problem. A common cause is insufficient quota holding `NodePoolsAvailable` in `reason: InProgress`.

If `SlurmClusterAvailable` is the blocker, inspect the `SlurmCluster` directly to see which subcomponent is pending:

```bash theme={"system"}
kubectl get slurmcluster -n tenant-slurm -o yaml
```

## Next steps

After your cluster is running, use the following resources to configure and manage it:

* [SunkCluster reference](/products/sunk/reference/sunkcluster-reference) for every supported field.
* [SUNK and Slurm versions](/products/sunk/reference/sunk-slurm-versions) for supported version combinations.
* [Connect to a Slurm login node](/products/sunk/access_sunk/connect-to-slurm-login-node) to start running jobs.
* [`SunkCluster` parameter reference: login settings](/products/sunk/reference/sunkcluster-reference#logingroup) to tune login pods and resources.
* [Enable GPU straggler detection](/products/sunk/optimize_workloads/enable-gpu-straggler-detection) for long training runs.