Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt

Use this file to discover all available pages before exploring further.

The SunkCluster custom resource declares the desired state of a managed SUNK cluster. The SUNK operator reconciles a SunkCluster into the underlying NodePools, NodeSets, and SlurmCluster resources that make up a running cluster. This page documents every supported field in the SunkCluster spec, along with the status conditions reported on the resource. To learn how to apply a SunkCluster, see Create a SUNK cluster.

Resource definition

FieldValue
apiVersionsunk.coreweave.com/v1alpha1
kindSunkCluster
namespacetenant-slurm

Top-level spec fields

The spec object configures the SUNK cluster. The following fields are supported:
FieldTypeRequiredDefaultDescription
sunkVersionstringYesNoneSUNK release version. See SUNK and Slurm versions for supported values.
slurmVersionstringYesNoneSlurm version. Must match the version paired with sunkVersion in the version mapping table.
cudaVersionstringNo13.0CUDA version installed on compute nodes. Allowed values: 13.0, 13.1.
ubuntuVersionstringNo24.04Ubuntu base image version. Allowed values: 22.04, 24.04.
nodeslist of NodeSpecNoNoneNode configurations. Each entry becomes one NodePool and one NodeSet. A SunkCluster with no nodes won’t provision a control plane or compute nodes and won’t become Ready.
storageStorageConfigNoSee StorageConfigHome directory and additional shared storage mounts.
loginLoginConfigNoNoneResources and access configuration for login pods.
s6list of S6NoNones6 service scripts to run on node initialization.
slurmConfigmap of string to stringNoNoneCustom key-value pairs passed through to the Slurm configuration.
schedulerSchedulerConfigNo{enabled: false}Scheduler configuration.
nvidiaDevicePluginNvidiaDevicePluginConfigNo{enabled: true}NVIDIA device plugin configuration.
certManagerCertManagerConfigNo{enabled: true}cert-manager deployment configuration.

NodeSpec

Each entry in spec.nodes defines one node group. The operator creates one NodePool and one NodeSet for each entry, and the count is kept in sync between them.
FieldTypeRequiredDefaultDescription
namestringYesNoneUnique node-group name used as the NodePool resource name. Must be 1-63 characters and match ^[a-z0-9]([a-z0-9-]*[a-z0-9])?$. Immutable.
instanceTypestringYesNoneCoreWeave instance type. See Allowed instance types for the full list. Immutable after creation.
countintegerYesNoneNumber of nodes of this type. Minimum 0. Mutable; changing this value scales the underlying NodePool.
controlPlanebooleanNofalseWhen true, this node group hosts the Slurm control plane.

Allowed instance types

The instanceType field uses the Instance ID listed on the available instances page. The field accepts one of the following values:
  • epyc
  • cd-hp-a96-genoa
  • cd-gp-a192-genoa
  • cd-hc-a384-genoa
  • turin-gp
  • turin-gp-l
  • cd-gp-i64-erapids
  • h100
  • gd-8xh100ib-i128
  • h200
  • gd-8xh200ib-i128
  • a100
  • gd-8xa100-i128
  • b200-8x
  • gb200
  • gb200-4x
  • gd-1xgh200
  • gd-8xl40-i128
  • gd-8xl40s-i128
  • rtxp6000-8x
  • gb300-4x
  • gb300-4x-e

StorageConfig

The spec.storage object configures shared storage for the cluster.
FieldTypeRequiredDefaultDescription
homeDirVolumeSpecNo{path: /mnt/home, size: 2Ti}Home directory storage configuration.
additionalMountslist of VolumeSpecNoNoneAdditional shared storage mounts.

VolumeSpec

FieldTypeRequiredDefaultDescription
pathstringYesNoneMount path inside the pod (for example, /mnt/data).
sizeKubernetes QuantityYesNoneStorage size as a Kubernetes quantity (for example, 2Ti, 100Gi).

LoginConfig

The spec.login object configures the login Pods, the groups whose members can access them, and the resources allocated to per-user and per-group Pods. Each user with access receives an individual login Pod, and each access group receives one shared login Pod.
FieldTypeRequiredDefaultDescription
groupslist of LoginGroupNoNoneGroups whose members can access the cluster. Each group name is also propagated to nsscache for POSIX user and group resolution.
sudoGroupslist of stringsNoNoneGroups whose members receive sudo access on login Pods.
accessAccessConfigNoNoneAccess annotations applied to login Pods.
userPodsLoginPodConfigNoNoneConfiguration applied to per-user login Pods.
groupPodsLoginPodConfigNoNoneConfiguration applied to per-group login Pods.

LoginGroup

Each entry in login.groups configures login-Pod creation for a single group. The list uses name as a merge key.
FieldTypeRequiredDefaultDescription
namestringYesNoneGroup name to match in the nsscache group secret. Must be at least 1 character.
userPodsbooleanNotrueWhen true, individual per-user login Pods are created for members of this group.
groupPodbooleanNotrueWhen true, a shared per-group login Pod is created for this group. Note the singular field name (groupPod) at this level.

LoginPodConfig

The login.userPods and login.groupPods objects each accept a LoginPodConfig.
FieldTypeRequiredDefaultDescription
resourcesResourceConfigNoNoneCPU and memory requests applied to this Pod type.

ResourceConfig

FieldTypeRequiredDefaultDescription
cpuKubernetes QuantityNoNoneCPU allocation (for example, 8).
memoryKubernetes QuantityNoNoneMemory allocation (for example, 32Gi).

AccessConfig

FieldTypeRequiredDefaultDescription
annotationsmap of string to stringNoNoneAdditional annotations applied to login Pods for access control.
Annotations may resemble the following:
  login:
    access:
      annotations:
        service.beta.kubernetes.io/external-hostname: sunk.<org-id>-<cluster-name>.coreweave.app
        service.beta.kubernetes.io/coreweave-load-balancer-ip-families: ipv4
        service.beta.kubernetes.io/coreweave-load-balancer-type: public

S6

Each entry in spec.s6 defines an s6 service script that runs on the targeted node types during initialization.
FieldTypeRequiredDefaultDescription
namestringYesNoneUnique name for the script.
typestringYesNoneScript type. Allowed values: oneshot, longrun.
targetslist of stringsYesNoneNode types the script runs on. Allowed values: login, compute. Must contain at least one entry.
scriptstringNoNoneInline script content to execute. See validation rules below.
packageslist of stringsNoNonePackages to install. Each element may contain one or more whitespace-separated package names. Only valid with type: oneshot.
dependslist of stringsNoNoneNames of other s6 scripts that must run before this one.
timeoutUpintegerNoNoneStartup timeout in milliseconds. Required for oneshot script entries. For longrun, set this or timeoutDown.
timeoutDownintegerNoNoneShutdown timeout in milliseconds. Only valid for longrun entries.
Each entry must set either script or packages, not both. The packages field is only valid when type is oneshot. For oneshot script entries, timeoutUp is required; for package installs, the operator applies an internal minimum timeout policy based on package count. For longrun entries, set at least one of timeoutUp or timeoutDown.

SchedulerConfig

The spec.scheduler object enables the SUNK scheduler.
FieldTypeRequiredDefaultDescription
enabledbooleanNofalseWhen true, the scheduler is enabled.

NvidiaDevicePlugin

The spec.nvidiaDevicePlugin object configures the NVIDIA device plugin DaemonSet.
FieldTypeRequiredDefaultDescription
enabledbooleanNotrueWhen true, the NVIDIA device plugin is deployed.

CertManager

The spec.certManager object configures the cert-manager deployment.
FieldTypeRequiredDefaultDescription
enabledbooleanNotrueWhen true, cert-manager is deployed.

Status conditions

The operator reports the cluster state through a standardized set of status.conditions. Each status condition includes a reason and a message that further describes the condition. The aggregate Ready condition is True when all dependent conditions are True.
ConditionMeaning
ReadyAggregate condition. True when every other condition is True.
NodePoolsAvailableAll requested NodePools have reached their target node count.
NodeSetsAvailableAll NodeSets have ready Pods.
SlurmClusterAvailableAggregate of the managed SlurmCluster’s own conditions. See SlurmCluster status conditions.
ControllerManagerAvailableThe SUNK controller manager Deployment is ready.
CertManagerAvailableAll cert-manager Deployments are ready, or cert-manager is disabled.
NvidiaDevicePluginAvailableThe NVIDIA device plugin DaemonSet is ready, or the device plugin is disabled.
The status also reports lastReadyTime, which records the most recent time the Ready condition transitioned to True. This field distinguishes a cluster that is bootstrapping (never ready) from one that was previously ready and has regressed. The reason attached to each condition contains one of the following values:
ReasonMeaning
ReadyThe described condition has been satisfied.
InProgressThe described condition is in progress but not yet complete.
BootstrappingSet on the aggregate Ready condition while the cluster is coming up for the first time.
ErrorThe described condition has encountered an error.
The message attached to a condition contains more detail. For dependent conditions, the message typically names the specific resources that are not yet ready (for example, NodePools not at target: [a192, gb200] or NodeSets not ready: [a192 (0/36 ready)]). For the aggregate Ready condition, the message lists the dependent conditions still pending (for example, Waiting for conditions: NodePoolsAvailable, NodeSetsAvailable, SlurmClusterAvailable).

SlurmCluster status conditions

The SlurmClusterAvailable condition on the SunkCluster aggregates the conditions reported by the underlying SlurmCluster resource. When SlurmClusterAvailable is not True, inspect the SlurmCluster itself to identify which subcomponent is pending or failing:
kubectl get slurmcluster -n tenant-slurm -o yaml
The SlurmCluster reports the following conditions, each of which must be True for SlurmClusterAvailable to aggregate to True:
ConditionMeaning
ReadyAggregate condition for the SlurmCluster. True when every other SlurmCluster condition is True.
SlurmctldAvailableThe Slurm controller (slurmctld) Deployment is ready.
LoginAvailableThe login Pods for each configured group are ready, or no login workloads are configured.
AccountingAvailableThe Slurm accounting (slurmdbd) workloads are ready.
DatabaseAvailableThe CWDBCluster backing Slurm accounting is ready.
SchedulerAvailableThe SUNK scheduler workloads are ready.
SyncerAvailableThe Slurm syncer workloads are ready.
NsscacheAvailableThe nsscache workloads that resolve POSIX users and groups are ready.
RestdAvailableThe Slurm REST daemon (slurmrestd) is ready, or slurmrestd is disabled.
CleanupCompletingAvailableThe cleanup-completing workload is ready, or the workload is disabled.

Example manifest

Use this manifest as a starting point and adjust the field values to match your cluster requirements. The following manifest creates a cluster with two control-plane nodes and four H100 GPU compute nodes, a 1 Ti home directory, an additional 1 Ti shared mount, and standard user and sudo groups.
apiVersion: sunk.coreweave.com/v1alpha1
kind: SunkCluster
metadata:
  name: [CLUSTER-NAME]
  namespace: tenant-slurm
spec:
  sunkVersion: "[SUNK-VERSION]"
  slurmVersion: "[SLURM-VERSION]"
  ubuntuVersion: "24.04"

  nodes:
    - name: control-plane
      instanceType: cd-gp-a192-genoa
      count: 2
      controlPlane: true
    - name: gpu
      instanceType: gd-8xh100ib-i128
      count: 4
      controlPlane: false

  storage:
    homeDir:
      path: /mnt/home
      size: 1Ti
    additionalMounts:
      - path: /mnt/data
        size: 1Ti

  login:
      groups:
        - name: slurm-users
          # userPods and groupPod default to true
        - name: sudo-users
          userPods: false
          groupPod: true
      sudoGroups:
        - sudo-users
      access:
        annotations:
          example.com/annotation: "value"
      groupPods:
        resources:
          memory: 8Gi
          cpu: 4
      userPods:
        resources:
          memory: 8Gi
          cpu: 4

Last modified on May 8, 2026