Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt

Use this file to discover all available pages before exploring further.

Version: 0.1.0 Type: application

Requirements

RepositoryNameVersion
file://../librarylibrary0.1.0
file://../slurm-loginslurm-login0.1.0
oci://registry-1.docker.io/bitnamichartsmysql9.19.1

Parameters

Key & DescriptionTypeDefault
accounting.annotations
Additional annotations for accounting resources.
object
{}
accounting.config.ArchiveEvents
string
"yes"
accounting.config.ArchiveJobs
string
"yes"
accounting.config.ArchiveResvs
string
"yes"
accounting.config.ArchiveSteps
string
"no"
accounting.config.ArchiveSuspend
string
"no"
accounting.config.ArchiveTXN
string
"no"
accounting.config.ArchiveUsage
string
"no"
accounting.config.AuthAltParameters[0]
string
"jwt_key=/etc/jwt/jwt.key"
accounting.config.AuthAltTypes
string
"auth/jwt"
accounting.config.AuthType
string
"auth/munge"
accounting.config.DbdPort
int
6819
accounting.config.DebugLevel
string
"verbose"
accounting.config.LogFile
string
"/dev/null"
accounting.config.PidFile
string
"/var/run/slurmdbd.pid"
accounting.config.PurgeEventAfter
string
"1month"
accounting.config.PurgeJobAfter
string
"12month"
accounting.config.PurgeResvAfter
string
"1month"
accounting.config.PurgeStepAfter
string
"1month"
accounting.config.PurgeSuspendAfter
string
"1month"
accounting.config.PurgeTXNAfter
string
"12month"
accounting.config.PurgeUsageAfter
string
"24month"
accounting.config.SlurmUser
string
"slurm"
accounting.config.StoragePort
int
3306
accounting.config.StorageType
string
"accounting_storage/mysql"
accounting.enabled
Enable the accounting.
bool
true
accounting.external.enabled
Enable the external accounting, instead of deploying an internal accounting instance. This configuration also requires the underlying database for slurmdbd to be managed externally.
bool
false
accounting.external.host
The host of the external accounting instance: IP or hostname.
string
null
accounting.external.port
The port of the external accounting instance.
string
null
accounting.external.user
The user to use to authenticate to the external accounting instance.
string
null
accounting.externalDB.enabled
Configure Slurm Accounting (slurmdbd) with an external database.
bool
false
accounting.externalDB.existingSecret
Specify the name of the Kubernetes Secret that contains the password used by slurmdbd to access the Slurm accounting database. This Secret must contain a data key named db-password whose value is the actual database password. Important: The password value stored in the Secret cannot contain the hash (#) character.
string
null
accounting.externalDB.storageHost
The hostname of the server where the database resides.
string
null
accounting.externalDB.storageLoc
The name of the database used to store Slurm accounting records. Defaults to “slurm_acct_db”.
string
"slurm_acct_db"
accounting.externalDB.storageUser
The username slurmdbd uses for authentication and storing job accounting data.
string
null
accounting.image
The image to use for slurmdbd deployment.
object
name: controller
repository:
tag:

accounting.labels
Additional labels for accounting resources.
object
{}
accounting.livenessProbe
The liveness probe for the slurmdbd container.
object
exec:
    command:
        - sacctmgr
        - ping
initialDelaySeconds: 15
periodSeconds: 10
failureThreshold: 5
successThreshold: 1
timeoutSeconds: 60

accounting.priorityClassName
The priority class name for the accounting pod.
string
"sunk-control-plane"
accounting.readinessProbe
The readiness probe for the slurmdbd container.
object
exec:
    command:
        - sacctmgr
        - ping
initialDelaySeconds: 15
periodSeconds: 10
failureThreshold: 5
successThreshold: 1
timeoutSeconds: 60

accounting.replicas
The number of replicas of the accounting instance to run.
int
1
accounting.resources
Resources for the accounting container.
object
limits:
    memory: 64Gi
requests:
    cpu: 16
    memory: 64Gi

accounting.securityContext.runAsGroup
The group to run as, must match the slurm GID from the container image.
int
401
accounting.securityContext.runAsUser
The user to run as, must match the slurm UID from the container image.
int
401
accounting.startupProbe
The startup probe for the slurmdbd container.
object
null
accounting.terminationGracePeriodSeconds
The termination grace period for the accounting pod.
int
30
accounting.useExistingSecret
Use an existing secret for the accounting instance instead of creating.
The secret name is the same as the mysql.auth.existingSecret.
bool
false
accounting.volumeMounts
Additional volume mounts to apply to the accounting pod.
list
[]
accounting.volumes
Additional volumes to mount to the accounting pod.
list
[]
cleanupCompleting.annotations
Additional annotations for cleanup-completing Job resources.
object
{}
cleanupCompleting.cronJobSchedule
The schedule for the cleanup-completing CronJob. It should be formatted according to the cron convention. Default runs every minute.
string
"* * * * *"
cleanupCompleting.deleteInvalidNodes
Enable deletion of nodes that are in INVALID_REG state after downing them. This allows nodes to cleanly re-register with Slurm.
bool
true
cleanupCompleting.dryRun
Enable dry run mode - shows what would be done without actually downing nodes.
bool
false
cleanupCompleting.enabled
Enable cleanup of nodes with jobs stuck in COMPLETING state.
bool
true
cleanupCompleting.labels
Additional labels for cleanup-completing Job resources.
object
{}
cleanupCompleting.nodeSelector.affinity
The affinity for the cleanup-completing Job. This overrides the value of global.nodeSelector.affinity.
object
null
cleanupCompleting.priorityClassName
The priority class name for the cleanup-completing Job pod.
string
"sunk-control-plane"
cleanupCompleting.resources
Resources for the cleanup-completing Job container.
object
limits:
    memory: 256Mi
requests:
    cpu: 100m
    memory: 64Mi

cleanupCompleting.timeoutSeconds
Timeout in seconds for jobs in COMPLETING state before downing nodes. Jobs that have been completing longer than this threshold will trigger node downing (if no other jobs are present on the node). If not specified, defaults to 2x KillWait from slurm.conf.
int
null
cleanupCompleting.tolerations
The tolerations for the cleanup-completing Job
list
null
cleanupCompleting.verbose
Enable verbose logging for debugging.
bool
true
compute.annotations
Additional annotations for compute services only. Use compute.nodes.custom-definition.annotations to add annotations to specific node definitions instead.
object
{}
compute.autoPartition.config
The following are intended for the customer to update. These values will be applied to each auto-generated partition. The partition name will be the same as the node definition name.
config:
  OverSubscribe: "YES"
  MaxTime: "12:00:00"
  QoS: "NORMAL"
object
null
compute.autoPartition.enabled
Enable the auto partition.
bool
true
compute.cacheDropper.enabled
An option to enable or disable the cache-dropper sidecar container across all slurmd pods.
bool
true
compute.cacheDropper.resources
Resources for the cache-dropper sidecar container.
object
limits:
    memory: 32Mi
requests:
    cpu: 500m
    memory: 32Mi

compute.epilogConfigMap
The name or list of configmap names containing epilog scripts
string | list
[]
compute.externalClusterName
The name of an external cluster to join.
This is used when control plane is deployed separately.
string
null
compute.generateTopology
Enable topology generation for the compute nodes in the cluster.
bool
true
compute.gpusd
Configuration for GPUSD (GPU Straggler Detection) metrics collection.
objectSee individual settings below.
compute.gpusd.enabled
Enable GPUSD package installation, metrics collection, and VMPodScrape resource.
bool
false
compute.gpusd.version
GPUSD version to install.
string
"1.0.0"
compute.initialState
The initial state for the nodes when they join the slurm cluster.
This is generally drain or idle. May also be set per node definition.
string
"idle"
compute.initialStateReason
The reason for setting the initial state of the nodes to down, drained, or fail.
May also be set per node definition.
string
"Node added to the cluster for the first time"
compute.labels
Additional labels for compute services only. Use compute.nodes.custom-definition.labels to add labels to specific node definitions instead.
object
{}
compute.livenessProbe
The liveness probe for the compute slurmd container.
object
map[]
compute.maxUnavailable
The maximum unavailability of the compute nodes during a rolling update.
Can be percentage or a number.
string
"10%"
compute.nodes
Multiple node definitions can be declared, but only one may be enabled: true. Node definitions can reference other definitions to include or overlay values. See the example below or the Compute Node Definitions documentation for more details.
compute:
  nodes:
    # A custom definition to be referenced by other nodes
    custom-dns:
      dnsPolicy: "None"
      dnsConfig:
        nameservers:
          - 127.0.0.1
    # A simple CPU-only Node that uses the custom-dns definition above
    simple-cpu:
      enabled: true
      # Partition creation for the nodeset. The nodeset will still be assigned to the `all`
      # partition if creation is disabled
      createPartition: true
      # The following Partition configurations are for the customer to update.
      # These values will be applied to the autogenerated partition and will override the defaults.
      partitionConfig:
        OverSubscribe: "YES"
        MaxTime: "12:00:00"
        QoS: "NORMAL"
      replicas: 1
      definitions:
        # Use the custom-dns definition
        - custom-dns
      staticFeatures:
        - cpu
      dynamicFeatures:
        node.coreweave.cloud/class: {}
      image:
        name: controller-extras
      gresGpu: null
      config:
        weight: 1
      # Create a small node with 1cpu and 1g memory
      resources:
        limits:
          memory: 1Gi
          cpu: 1
        requests:
          memory: 1Gi
          cpu: 1
      tolerations:
        - key: is_cpu_compute
          operator: Exists
      volumeMounts:
        - name: ramtmp
          mountPath: /tmp
      volumes:
        - name: ramtmp
          emptyDir:
            medium: Memory
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: kubernetes.io/os
                    operator: In
                    values:
                      - linux
objectSee Compute Node Definitions.
compute.partitionBaseConfig
Default configuration for partitions in the cluster. These values can be overridden per partition in the autoPartition section.
partitionBaseConfig:
  MaxTime: "INFINITE"
  State: "UP"
object
{
  "MaxTime": "INFINITE",
  "State": "UP"
}
compute.partitions
Partitions to add to the cluster. The key is the partition name and the value is the partition configuration.
object
all: Nodes=ALL Default=YES

compute.plugstackConfig
Additional plug-in stack configuration items for plugstack.conf file config. Config Options: https://slurm.schedmd.com/spank.html#SECTION_CONFIGURATION
list
[]
compute.ports
Additional ports to expose on the compute nodes.
ports:
  - containerPort: 10400
    protocol: TCP
  - containerPort: 10401
    protocol: TCP
  - containerPort: 10402
    protocol: TCP
  - containerPort: 10403
    protocol: TCP
  - containerPort: 10404
    protocol: TCP
  - containerPort: 10405
    protocol: TCP
  - containerPort: 10406
    protocol: TCP
  - containerPort: 10407
    protocol: TCP
list
[]
compute.prologConfigMap
The name or list of configmap names containing prolog scripts
string | list
[]
compute.pyxis.appArmorProfile
The AppArmor profile to use for the pyxis container.
string
"localhost/enroot"
compute.pyxis.enabled
Enable the pyxis container.
bool
true
compute.pyxis.enrootConfig
Configuration options for enroot.
object
ENROOT_RUNTIME_PATH: /run/enroot/user-$(id -u)
ENROOT_CACHE_PATH: /opt/sunk/tmp/enroot-cache/user-$(id -u)
ENROOT_DATA_PATH: /opt/sunk/tmp/enroot-data/user-$(id -u)
# Enables <code>ENROOT_MOUNT_HOME</code> for the pyxis container to mount the home directory.
ENROOT_MOUNT_HOME: y
# Enables <code>ENROOT_REMAP_ROOT</code> for the pyxis container to remap the root user.
ENROOT_REMAP_ROOT: y
ENROOT_RESTRICT_DEV: n
ENROOT_ROOTFS_WRITABLE: y

compute.pyxis.plugstackOptions
Additional arguments for the pyxis plugin in plugstack.conf file config. Config Options: https://github.com/NVIDIA/pyxis/wiki/Setup#slurm-plugstack-configuration
list
[
  "container_scope=global"
]
compute.pyxis.podSecurityContext
Security context for the pyxis container.
object
{
  "seccompProfile": {
    "localhostProfile": "profiles/enroot",
    "type": "Localhost"
  }
}
compute.pyxis.podSecurityContext.seccompProfile
The seccomp profile to use for the pyxis container.
object
{
  "localhostProfile": "profiles/enroot",
  "type": "Localhost"
}
compute.readinessProbe
The readiness probe for the compute slurmd container.
object
exec:
    command:
        - scontrol
        - show
        - slurmd
failureThreshold: 3
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5

compute.reservedMemory
Reserved memory when calculating DefMemPerCPU config for slurm.conf
string
"4Gi"
compute.s6
oneshot and longrun jobs are supported. See Running Scripts with S6 for more information.
s6:
  packages:
    type: oneshot
    timeoutUp: 0
    timeoutDown: 0
    script: |
      #!/usr/bin/env bash
      apt -y update
      apt -y install nginx
  nginx:
    type: longrun
    timeoutUp: 0
    timeoutDown: 0
    script: |
      #!/usr/bin/env bash
      nginx -g "daemon off;"
object
{}
compute.securityContext.capabilities.add
Add capabilities to the slurmd container.
“SYS_ADMIN” is required if using pyxis.
compute:
  securityContext:
    capabilities:
      add: ["SYS_ADMIN"]
list
[
  "SYS_NICE",
  "SYS_ADMIN",
  "SYS_PTRACE",
  "SYSLOG"
]
compute.ssh.enabled
Enable ssh to the compute nodes.
bool
true
compute.startupProbe
The startup probe for the compute slurmd container.
object
map[]
compute.volumeMounts
Additional volume mounts to add to all the compute pods, also added to login pods.
list
[]
compute.volumes
Additional volumes to mount to all the compute pods, also added to login pods.
list
[]
controlPlane.enabled
Enable the Slurm control plane.
Unless splitting the deployment this should be enabled.
bool
true
controller.annotations
Additional annotations for controller resources.
object
{}
controller.enabled
Enable the controller deployment
This should be enabled unless more complicated deployment is required (splitting the deployment).
bool
true
controller.etcConfigMap
The ConfigMap(s) with keys mapping to files in /etc/slurm on the controller only.
This ConfigMap must not contain:
  • slurm.conf
  • plugstack.conf
  • gres.conf
  • cgroup.conf
  • topology.conf
string | list
null
controller.image
The image to use for the controller.
object
name: controller
repository:
tag:

controller.labels
Additional labels for controller resources.
object
{}
controller.livenessProbe
The liveness probe for the controller.
object
exec:
    command:
        - scontrol
        - ping
failureThreshold: 6
initialDelaySeconds: 15
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 60

controller.priorityClassName
The priority class name for the controller.
string
"sunk-control-plane"
controller.readinessProbe
The readiness probe for the controller.
object
map[]
controller.replicas
The number of replicas of the controller to run, currently should be left at 1.
int
1
controller.resources
Resources for the controller container.
object
limits:
    memory: 64Gi
requests:
    cpu: 16
    memory: 64Gi

controller.securityContext.runAsGroup
The group to run as, must match the slurm GID from the container image.
int
401
controller.securityContext.runAsUser
The user to run as, must match the slurm UID from the container image.
int
401
controller.startupProbe
The startup probe for the controller.
object
map[]
controller.stateVolume.size
The size of the persistent volume claim.
string
"32Gi"
controller.stateVolume.storageClassName
The storage class name to use for the volume.
string
"shared-vast"
controller.terminationGracePeriodSeconds
The termination grace period for the controller.
int
30
controller.volumeMounts
Additional volume mounts to apply to the controller pod.
list
[]
controller.volumes
Additional volumes to mount to the controller pod.
list
[]
controller.watch.enabled
Enable watching the Slurm configuration and triggering a reconfigure when there are changes.
bool
true
controller.watch.interval
The interval in seconds to check for changes in the Slurm configuration.
int
60
controller.watch.livenessProbe
The liveness probe for the watch container.
object
null
controller.watch.readinessProbe
The readiness probe for the watch container.
object
null
controller.watch.startupProbe
The startup probe for the watch container.
object
null
directoryService.debugLevel
A bit mask of what SSSD debug levels to enable.
int
0x01F0
directoryService.directories
The directory services to configure. Click to expand examples.
directories:
  - name: google-example.com
    enabled: true
    ldapUri: ldaps://ldap.google.com:636
    user:
      canary: user@google-example.com
    defaultShell: "/bin/bash"
    fallbackHomeDir: "/home/%u"
    overrideHomeDir: /mnt/nvme/home/%u
    ldapsCert: google-ldaps-cert
    schema: rfc2307bis
directories:
  - name: coreweave.cloud
    enabled: true
    ldapUri: ldap://openldap
    user:
      bindDn: cn=admin,dc=coreweave,dc=cloud
      searchBase: dc=coreweave,dc=cloud
      existingSecret: bind-user-sssd-config
      canary: admin
    defaultShell: "/bin/bash"
    fallbackHomeDir: "/home/%u"
    schema: rfc2307
directories:
  - name: coreweave.cloud
    enabled: true
    ldapUri: ldap://authentik-outpost-ldap-outpost
    user:
      bindDn: cn=ldapsvc,dc=coreweave,dc=cloud
      searchBase: dc=coreweave,dc=cloud
      existingSecret: bind-user-sssd-config
      canary: ldapsvc
    startTLS: true
    userObjectClass: user
    groupObjectClass: group
    userNameAttr: cn
    groupNameAttr: cn
    schema: rfc2307bis
directories:
  - name: contoso.com
    enabled: true
    ldapUri: ldap://domaincontroller.tenant-my-tenant.coreweave.cloud
    user:
      bindDn: CN=binduser,CN=Users,DC=contoso,DC=com
      searchBase: DC=contoso,DC=com
      existingSecret: bind-user-sssd-config
      canary: binduser
    defaultShell: "/bin/bash"
    fallbackHomeDir: "/home/%u"
    schema: AD
list 
directoryService.directories[0].additionalConfig
Multi-line string of additional arbitrary config per domain for sssd.
additionalConfig: |
  ldap_foo = bar
string
null
directoryService.directories[0].defaultShell
The default user shell.
string
"/bin/bash"
directoryService.directories[0].enabled
Enable the directory service.
bool
false
directoryService.directories[0].fallbackHomeDir
The fallback user home directory.
string
"/home/%u"
directoryService.directories[0].ignoreGroupMembers
This overrides SSSD configuration of the same name
If set to true, SSSD only retrieves information about the group objects
themselves and not their members, providing a significant performance boost.
If omitted, defaults to true.
bool
null
directoryService.directories[0].ldapUri
The LDAP URI to use for the directory service.
Example: ldap://YOUR_LDAP_IP
For Google Secure LDAP, use: ldaps://ldap.google.com:636
string
null
directoryService.directories[0].ldapsCert
Name of existing TLS certificate for LDAP-S.
kubectl create secret tls google-ldaps-cert \
        --cert=Google_2025_08_24_55726.crt \
        --key=Google_2025_08_24_55726.key
string
null
directoryService.directories[0].name
Name of the directory service.
The primary domain should always be named: default
string
"default"
directoryService.directories[0].overrideGidAttr
Override the default schema LDAP attribute that corresponds to the user’s primary group id.
Example: posixGid
string
null
directoryService.directories[0].overrideHomeDir
Override the homeDirectory attribute from LDAP with a provided path.
Example: /mnt/nvme/home/%u
string
null
directoryService.directories[0].overrideUidAttr
Override the default schema LDAP attribute that corresponds to the user’s id.
Example: posixUid
string
null
directoryService.directories[0].overrideUserNameAttr
Override the default schema LDAP attribute that corresponds to the user’s login name.
Example: employeeNumber
string
null
directoryService.directories[0].schema
The desired LDAP schema for the directory service. Valid values:
  • AD
  • POSIX
  • rfc2307bis
Note: For Google Secure LDAP, use rfc2307bis.
string
"AD"
directoryService.directories[0].user.bindDn
The LDAP bind DN to use for the directory service.
Where bindDn is not required (e.g. Google Secure LDAP), only supply user.canary.
Example: cn=Admin,ou=Users,ou=CORP,dc=corp,dc=example,dc=com
string
null
directoryService.directories[0].user.canary
The username to lookup to confirm LDAP is working.
string
null
directoryService.directories[0].user.existingSecret
Name of an existing secret containing an SSSD configuration snippet with the ldap_default_authtok set for this domain.
string
null
directoryService.directories[0].user.existingSecretFileName
The name of the file in the existing secret that contains the ldap passwords.
string
"ldap-password.conf"
directoryService.directories[0].user.groupSearchBase
The LDAP group search base to use for the directory service.
Example: ou=groups,dc=example,dc=com
string
null
directoryService.directories[0].user.password
The password to use for the directory service lookups.
string
null
directoryService.directories[0].user.searchBase
The LDAP search base to use for the directory service.
Example: dc=corp,dc=example,dc=com
string
null
directoryService.negativeCacheTimeout
Negative caching value (in seconds).
Determines how long an invalid entry will be cached before asking LDAP again. This improves the directory listing time when a primary gid cannot be found.
string
"600"
directoryService.sudoGroups
List of Unix groups from all directories with sudo privileges.
Group names are fully-qualified for additional directories. Group names are not fully-qualified for the default directory; (e.g. “group1” instead of “group1@domain.com”)
list
[]
directoryService.watchInterval
The interval in seconds to check for changes in sssd configuration.
int
60
global.annotations
Additional annotations to apply to all resources.
object
{}
global.cks
Enable CoreWeave Kubernetes Services (CKS) integration.
bool
true
global.dnsConfig.additionalSearches
A list of namespaces to add to the list of DNS searches. These additional searches extend hostname lookup in the control-plane, compute, and login pods. Default dns searches: - name-compute.namespace.svc.cluster.local - slurm_cluster_name-controller.namespace.svc.cluster.local
list
[]
global.imagePullPolicy
The image pull policy for all containers.
string
"IfNotPresent"
global.labels
Additional labels to apply to the all resources.
object
{}
global.nodeSelector.affinity
The affinity for the Slurm control-plane components.
object
nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
            - matchExpressions:
                - key: node.coreweave.cloud/class
                  operator: In
                  values:
                    - cpu

global.volumeMounts
The list of volume mounts to apply to all compute, controller, accounting, and login pods
list
[]
global.volumes
The list of volumes to mount to all compute, controller, accounting, and login pods
list
[]
imagePullSecrets
The list of secrets used to access images in a private registry.
list
[]
jwt.existingSecret
The name of an existing secret containing the JWT private key, otherwise the chart will generate one.
string
null
login.annotations
Additional annotations.
object
{}
login.automountServiceAccountToken
Automatically mount the service account token into the login pod.
bool
false
login.containers
Additional sidecar containers to add to the login pod.
list
[]
login.enabled
Enable the login nodes
bool
true
login.env
Additional environment variables to pass to the sshd container.
list
[]
login.hostAliases
Provides Pod-level override of hostname resolution when DNS and other options are not applicable in login pods. See Adding entries to Pod /etc/hosts with HostAliases for more information.
list
[]
login.image
The image to use for the login node.
object
name: controller-extras
repository:
tag:

login.individualResources
Resources for the slurm-login pod sshd container.
object
limits:
    memory: 2Gi
requests:
    cpu: 500m
    memory: 300Mi

login.labels
Additional labels.
object
null
login.nodeSelector.affinity
The affinity for the login nodes. This overrides the value of global.nodeSelector.affinity.
object
null
login.priorityClassName
The priority class name for the login pod.
string
null
login.replicas
The number of replicas of the login node.
When running more than one, a pod specific-service is created for each one in addition to the main service.
int
1
login.resources.limits.memory
string
"8Gi"
login.resources.requests.cpu
int
4
login.resources.requests.memory
string
"8Gi"
login.s6
oneshot and longrun jobs are supported. See Running Scripts with S6 for more information.
s6:
  packages:
    type: oneshot
    timeoutUp: 0
    timeoutDown: 0
    script: |
      #!/usr/bin/env bash
      apt -y update
      apt -y install nginx
  nginx:
    type: longrun
    timeoutUp: 0
    timeoutDown: 0
    script: |
      #!/usr/bin/env bash
      nginx -g "daemon off;"
object
{}
login.service.additionalPorts
Additional port definitions to expose.
additionalPorts:
  - name: eternal-shell
    port: 2022
    targetPort: 20222 # optional
    protocol: TCP # optional
list
[]
login.service.enabled
Enable the creation of service(s) for login pods.
bool
true
login.service.externalTrafficPolicy
The external traffic policy.
string
"Local"
login.service.loadBalancerClass
The load balancer class to use for the login services.
string
null
login.service.metadata.0.annotations
Additional annotations to apply to the first login service (0).
object
{}
login.service.metadata.0.labels
Additional labels to apply to the common first login service (0).
object
{}
login.service.metadata.common.annotations
Additional annotations to apply to the common login service.
object
{}
login.service.metadata.common.labels
Additional labels to apply to the common login service.
object
{}
login.service.metadata.global.annotations
Additional annotations to apply to all login services.
object
null
login.service.metadata.global.labels
Additional labels to apply to all login services.
object
{}
login.service.type
The type of service to create. This defaults to LoadBalancer for cloud deployments. For development and test systems without an external load balancer to handle the service routing, such as when deploying on kind (Kubernetes IN Docker), this may be set to ClusterIP.
string
"LoadBalancer"
login.serviceAccountName
The service account name to use for the login pod.
string
"default"
login.sshKeyVolume.accessModes
The access mode for the storage. If scaling login beyond 1 replica, this must be ReadWriteMany. In a development setting with a volume provider that doesn’t support ReadWriteMany, such as kind (Kubernetes IN Docker), this may be set to ReadWriteOnce.
string
[
  "ReadWriteMany"
]
login.sshKeyVolume.enabled
Enable the ssh key volume, to allow keys to be mounted and persisted in the login pod.
If this is disabled the host keys for the login pod will be regenerated on each container restart.
bool
true
login.sshKeyVolume.size
The size of the persistent volume claim.
string
"1Gi"
login.sshKeyVolume.storageClassName
The storage class name to use for the volume.
string
"shared-vast"
login.sshdConfig
Additional sshd configuration to add to the login pod.
sshdConfig:
  PasswordAuthentication: "no"
string
null
login.sshdLivenessProbe.config
The liveness probe for the login sshd container.
object
failureThreshold: 10
initialDelaySeconds: 10
periodSeconds: 5
tcpSocket:
    port: 22

login.sshdLivenessProbe.enabled
If the liveness probe for the login sshd is enabled
bool
false
login.sshdReadinessProbe.config
The readiness probe for the login sshd container.
object
map[]
login.sshdReadinessProbe.enabled
If the readiness probe for the login sshd container is enabled
bool
false
login.sshdStartupProbe.config
The startup probe for the login sshd container.
object
map[]
login.sshdStartupProbe.enabled
If the startup probe for the login sshd container is enabled
bool
false
login.terminationGracePeriodSeconds
The termination grace period for the login pod.
int
30
login.updateStrategy
The update strategy for the login node- Default is type is RollingUpdate
object
{}
login.volumeMounts
Additional volume mounts to apply to the login pod.
list
[]
login.volumes
Additional volumes to add to the login pod.
volumes:
  - name: cache-vol
    emptyDir:
      medium: Memory
list
[]
moco
Options for MOCO MySQL used for Slurm job accounting.
objectSee individual settings below.
moco.enabled
Enable moco.
bool
false
moco.migration.enabled
When enabled: true, a Kubernetes Job is created to perform the migration of the Slurm accounting database to MOCO MySQL. This job runs once and then completes. Any existing Slurm accounting database in bitnami MySQL database will be migrated to the MOCO MySQL database. This should be set to true only for the initial migration, and then set to false afterwards to bring the cluster back to normal operation. During this automated migration, the Slurm cluster will not be in a functional state.
bool
false
moco.mysqlCluster.affinity
Optional pod affinity configuration
object
map[]
moco.mysqlCluster.auth.existingSecret
Optional, will use randomly generated moco WRITABLE_PASSWORD if not set Specify the name of the Kubernetes Secret that contains the password used by slurmdbd to access the MOCO MySQL database. This Secret must contain a data key named WRITABLE_PASSWORD whose value is the actual database password.
string
null
moco.mysqlCluster.auth.storageHost
The hostname of the server where the database resides.
string
"moco-{{ .Release.Name }}"
moco.mysqlCluster.auth.storageLoc
The name of the database used to store Slurm accounting records. Defaults to “slurm_acct_db”.
string
"slurm_acct_db"
moco.mysqlCluster.auth.storageUser
The username slurmdbd uses for authentication and storing job accounting data.
string
"moco-writable"
moco.mysqlCluster.config
Additional MySQL configuration to add to the mysqlCluster. This will be placed in a ConfigMap and referenced by the mysqlCluster. The contents of the configmap will be rendered as a template, so helm expressions can be used. This needs to render as valid yaml. MySQL option file documentation
config: |
  max_connections: 1000
  wait_timeout: 28800
string
null
moco.mysqlCluster.image
The image to use for mysql.
object
repository: ghcr.io/cybozu-go/moco/mysql
tag: 8.4.6

moco.mysqlCluster.inodeLockFixer.enabled
Configure init container to fix inode locking issues. This init container copies, moves, and replaces the MySQL data directory to prevent known inode locking issues that can occur in certain storage environments.
bool
true
moco.mysqlCluster.inodeLockFixer.image
The image to use for the inode lock fixer init container.
object
repository: alpine
tag: 3.20.0

moco.mysqlCluster.persistence
The volume settings to use for mysql.
object
storageClassName:
size: 128Gi
accessModes:
    - ReadWriteOnce

moco.mysqlCluster.resources
Resources for the moco mysqld container.
object
requests:
    memory: 64Gi
    cpu: "16"
limits:
    memory: 64Gi

moco.priorityClassName
The priority class name for moco.
string
"sunk-control-plane"
munge.args
The additional arguments to pass to the munge container. The defaults run Munge with 10 threads instead of 2.
list
[
  "--num-threads",
  "10"
]
munge.livenessProbe
Liveness probe for the munge container. When munged hangs (e.g. thread deadlock), munge -n blocks on the Unix socket and the probe times out, causing Kubernetes to restart only the munged container.
object
exec:
    command: ["munge", "-n"]
initialDelaySeconds: 60
periodSeconds: 180
timeoutSeconds: 180
failureThreshold: 5

munge.readinessProbe
The readiness probe for the munge container.
object
map[]
munge.resources
Resources for the munge container.
object
limits:
    memory: 2Gi
requests:
    cpu: 1
    memory: 2Gi

munge.securityContext.runAsGroup
The group to run as, must match the munge GID from the container image.
int
400
munge.securityContext.runAsUser
The user to run as, must match the munge UID from the container image.
int
400
munge.startupProbe
The startup probe for the munge container.
object
map[]
mysql
Options for Bitnami MySQL chart, uses Bitnami default values. There is an added option here: vmPodScrape.enabled which can be used as an alternative to serviceMonitor.enabled.
objectSee Bitnami default values.
nsscache.annotations
Additional annotations for nsscache Job resources.
object
{}
nsscache.cronJobSchedule
The schedule for the nsscache update CronJob. It should be formatted according to the cron convention.
string
"* * * * *"
nsscache.enabled
Enable nsscache.
bool
true
nsscache.existingSecret
Name of an existing secret containing the LDAP password for this domain. This secret should contain a key named nsscache-ldap-password which contains the password to use for the LDAP bind DN. For SCIM, this secret should contain a key named nsscache-scim-auth-token which contains the token to use for the SCIM server.
string
null
nsscache.labels
Additional labels for nsscache Job resources.
object
{}
nsscache.nodeSelector.affinity
The affinity for the nsscache Job. This overrides the value of global.nodeSelector.affinity.
object
null
nsscache.nsscacheConfig
Options for defining nsscache.conf. Click to exapand examples.
nsscacheConfig:
  default:
    source: ldap
    ldap_uri: ldap://authentik-outpost-ldap-outpost
    ldap_base: dc=coreweave,dc=cloud
    ldap_bind_dn: cn=ldapsvc,dc=coreweave,dc=cloud
    ldap_bind_password:
    ldap_rfc2307bis: 1
    ldap_default_shell: /bin/bash
    ldap_scope: sub
    ldap_uidattr: cn
  passwd:
    ldap_filter: (objectClass=user)
    ldap_override_home_dir: /mnt/home/%%u
  group:
    ldap_filter: (objectClass=group)
  shadow:
    ldap_filter: (objectClass=user)
  sshkey:
    ldap_filter: (objectClass=user)
nsscacheConfig:
  default:
    source: scim
    scim_base_url: https://api.coreweave.com/scim/abc123
    scim_users_parameters: filter=active eq "true"&groups=slurm-users,slurm-admins
    scim_groups_parameters: excludeInactiveUsers=true&includeVirtualUserGroups=slurm-users,slurm-admins
objectSee the nsscache.conf documentation.
nsscache.nsscacheConfig.default.cache
Specifying the means in which the cache data will be stored.
string
"files"
nsscache.nsscacheConfig.default.files_cache_filename_suffix
A suffix appended to the cache filename to differentiate it from, say, system NSS databases.
string
"cache"
nsscache.nsscacheConfig.default.files_dir
Directory location to store the plain text files in.
string
"/etc/nsscache"
nsscache.nsscacheConfig.default.ldap_base
The base to perform LDAP searches under.
Example: dc=coreweave,dc=cloud
string
null
nsscache.nsscacheConfig.default.ldap_bind_dn
The bind DN to use when connecting to LDAP. Empty string is an anonymous bind.
Example: cn=ldapsvc,dc=coreweave,dc=cloud
string
null
nsscache.nsscacheConfig.default.ldap_bind_password
The password to use for the LDAP bind DN. We strongly recommend using a Kubernetes secret to store this password and reference it using the nsscache.existingSecret value.
string
null
nsscache.nsscacheConfig.default.ldap_default_shell
This will be the default shell for all users. You can specify a different shell by setting the loginShell value in the user attributes in the source directory configuration.
Example: /bin/bash
string
null
nsscache.nsscacheConfig.default.ldap_rfc2307bis
Example: 1
int
null
nsscache.nsscacheConfig.default.ldap_scope
The search scope to use for LDAP. Example: sub
string
null
nsscache.nsscacheConfig.default.ldap_uidattr
The uid-like attribute in your LDAP directory. Example: cn
string
null
nsscache.nsscacheConfig.default.ldap_uri
The LDAP URI to connect to.
Example: ldap://authentik-outpost-ldap-outpost
string
null
nsscache.nsscacheConfig.default.maps
The recommended defaults below are useful for standard nsscache operation in many environments.
list
[
  "passwd",
  "shadow",
  "group",
  "sshkey"
]
nsscache.nsscacheConfig.default.scim_base_url
The base URL for the SCIM server.
Example: https://api.coreweave.com/scim/<org>
string
null
nsscache.nsscacheConfig.default.scim_groups_endpoint
The endpoint for the SCIM groups API.
string
"CoreWeaveGroups"
nsscache.nsscacheConfig.default.scim_groups_parameters
Option to use url parameters for groups endpoint. Special characters (spaces, quotes, etc.) will be automatically URL encoded. There is a custom parameter for creating virtual user groups that is a comma separated list. It will create an entry in the groups map for the user’s gid for the members of the selected group(s). This parameter typically should match any group filtering in scim_users_parameters. Including a filter for inactive users by default.
Example: excludeInactiveUsers=true&includeVirtualUserGroups=slurm-users,slurm-admins
string
"excludeInactiveUsers=true"
nsscache.nsscacheConfig.default.scim_users_endpoint
The endpoint for the SCIM users API.
string
"Users"
nsscache.nsscacheConfig.default.scim_users_parameters
Option to use url parameters for users endpoint. Special characters (spaces, quotes, etc.) will be automatically URL encoded. There is a custom parameter for filtering by groups that is a comma separated list. Including a filter for inactive users by default.
Example: filter=active eq “true”&groups=slurm-users,slurm-admins
stringfilter=active eq “true”
nsscache.nsscacheConfig.default.source
Specify the data source to use. Supported options are scim and ldap.
string
"scim"
nsscache.nsscacheConfig.default.timestamp_dir
Specifying the location of the timestamps used for incremental updates.
string
"/var/lib/nsscache"
nsscache.nsscacheConfig.group.scim_path_gid
The SCIM path for the GID attribute.
string
"sunkPosixGroupId"
nsscache.nsscacheConfig.group.scim_path_groupname
The SCIM path for the group name attribute. Used when the SCIM server provides a custom field for group names. If not specified or the path returns no value, nsscache will fall back to using displayName, name, or id from the SCIM group resource.
string
"sunkPosixGroupName"
nsscache.nsscacheConfig.group.scim_path_username
The SCIM path for the GID attribute.
string
"members/sunkPosixUsername"
nsscache.nsscacheConfig.passwd.ldap_filter
The search filter to use when querying.
Example: (objectClass=user)
string
null
nsscache.nsscacheConfig.passwd.ldap_override_home_dir
This will override the home directory all users. %%u will be replaced with the username. this should match the mount found in compute.VolumeMounts
Example: /mtn/home/%%u
string
null
nsscache.nsscacheConfig.passwd.scim_default_shell
This will be the default shell for all users.
string
"/bin/bash"
nsscache.nsscacheConfig.passwd.scim_override_home_directory
This will override the home directory all users. %%u will be replaced with the username. this should match the mount found in compute.VolumeMounts
Example: /mnt/home/%%u
string
"/mnt/home/%%u"
nsscache.nsscacheConfig.passwd.scim_path_gid
The SCIM path for the GID attribute.
string
"urn:coreweave:params:scim:schemas:extension:coreweave:2.0:CoreWeaveUser/sunkPosixGroupId"
nsscache.nsscacheConfig.passwd.scim_path_home_directory
The SCIM path for the home directory attribute.
string
"urn:coreweave:params:scim:schemas:extension:coreweave:2.0:CoreWeaveUser/sunkPreferredHomeDirectory"
nsscache.nsscacheConfig.passwd.scim_path_login_shell
The SCIM path for the login shell attribute.
string
"urn:coreweave:params:scim:schemas:extension:coreweave:2.0:CoreWeaveUser/sunkLoginShell"
nsscache.nsscacheConfig.passwd.scim_path_uid
The SCIM path for the UID attribute.
string
"urn:coreweave:params:scim:schemas:extension:coreweave:2.0:CoreWeaveUser/sunkPosixUserId"
nsscache.nsscacheConfig.passwd.scim_path_username
The SCIM path for the username attribute.
string
"urn:coreweave:params:scim:schemas:extension:coreweave:2.0:CoreWeaveUser/sunkPosixUsername"
nsscache.nsscacheConfig.shadow.ldap_filter
The search filter to use when querying.
Example: (objectClass=user)
string
null
nsscache.nsscacheConfig.shadow.scim_path_username
The SCIM path for the username attribute.
string
"urn:coreweave:params:scim:schemas:extension:coreweave:2.0:CoreWeaveUser/sunkPosixUsername"
nsscache.nsscacheConfig.sshkey.ldap_filter
The search filter to use when querying.
Example: (objectClass=user)
string
null
nsscache.nsscacheConfig.sshkey.scim_path_ssh_keys
The SCIM path for the SSH keys attribute.
string
"urn:coreweave:params:scim:schemas:extension:coreweave:2.0:CoreWeaveUser/sunkSshKeys"
nsscache.nsscacheConfig.sshkey.scim_path_username
The SCIM path for the username attribute.
string
"urn:coreweave:params:scim:schemas:extension:coreweave:2.0:CoreWeaveUser/sunkPosixUsername"
nsscache.nsswitchConfig
Options for defining nsswitch.conf.
objectSee the nsswitch.conf documentation.
nsscache.nsswitchConfig.aliases
Mail aliases, used by getaliasent(3) and related functions.
list
[]
nsscache.nsswitchConfig.ethers
Ethernet numbers.
list
[]
nsscache.nsswitchConfig.group
Groups of users, used by getgrent(3) and related functions.
list
[
  "files",
  "cache"
]
nsscache.nsswitchConfig.hosts
Host names and numbers, used by gethostbyname(3) and related functions.
list
[]
nsscache.nsswitchConfig.initgroups
Supplementary group access list, used by getgrouplist(3) function.
list
[]
nsscache.nsswitchConfig.netgroup
Network-wide list of hosts and users, used for access rules. C libraries before glibc 2.1 supported netgroups only over NIS.
list
[]
nsscache.nsswitchConfig.networks
Network names and numbers, used by getnetent(3) and related functions.
list
[]
nsscache.nsswitchConfig.passwd
User passwords, used by getpwent(3) and related functions.
list
[
  "files",
  "cache"
]
nsscache.nsswitchConfig.protocols
Network protocols, used by getprotoent(3) and related functions.
list
[]
nsscache.nsswitchConfig.publickey
Public and secret keys for Secure_RPC used by NFS and NIS+.
list
[]
nsscache.nsswitchConfig.rpc
Remote procedure call names and numbers, used by getrpcbyname(3) and related functions.
list
[]
nsscache.nsswitchConfig.services
Network services, used by getservent(3) and related functions.
list
[]
nsscache.nsswitchConfig.shadow
Shadow user passwords, used by getspnam(3) and related functions.
list
[]
nsscache.priorityClassName
The priority class name for the nsscache Job pod.
string
"sunk-control-plane"
nsscache.resources
Resources for the nsscache Job container.
object
limits:
    memory: 500Mi
requests:
    cpu: 200m
    memory: 100Mi

nsscache.slurmUserProvisioning.defaultSlurmAccount
The default Slurm account for automated provisioning of users.
string
"cw-sup"
nsscache.slurmUserProvisioning.dryRun
Enable dry run mode - shows what would be done without making changes.
bool
false
nsscache.slurmUserProvisioning.enabled
Enable slurmUserProvisioning.
bool
true
nsscache.slurmUserProvisioning.interval
The interval in seconds between user sync runs.
int
60
nsscache.sudoGroups
List of Unix groups with sudo privileges.
list
[]
nsscache.tolerations
The tolerations for the nsscache Job
list
null
rest.annotations
Additional annotations for REST API resources.
object
{}
rest.args
The additional arguments to pass to the rest container. Defaults enable debug logging and only load most recent openAPI plugins.
list
[
  "-vv",
  "-sslurmdbd,slurmctld",
  "-dv0.0.40"
]
rest.containers
Additional sidecar containers to add to the restd pod.
list
[]
rest.enabled
Enable the REST API deployment
This is optional and should be disabled for most use cases.
bool
false
rest.env
The additional environment variables to pass to the rest container.
list
[
  {
    "name": "SLURMRESTD_JSON",
    "value": "compact"
  }
]
rest.image
The image to use for the REST API deployment.
object
name: controller
repository:
tag:

rest.labels
Additional labels for REST API resources.
object
{}
rest.livenessProbe
The liveness probe for the rest container.
object
tcpSocket:
    port: slurmrestd
failureThreshold: 2
periodSeconds: 10

rest.priorityClassName
The priority class name for the rest pod.
string
null
rest.readinessProbe
The readiness probe for the rest container.
object
tcpSocket:
    port: slurmrestd
periodSeconds: 5
failureThreshold: 1

rest.replicas
The number of replicas of the rest pod to run.
In most production environments this should be set to a minimum of 2 to provide HA.
int
1
rest.resources
Resources for the slurmrestd container.
These defaults are appropriate for small and medium-sized clusters.
object
limits:
    memory: 64Gi
requests:
    cpu: 2
    memory: 8Gi

rest.securityContext.runAsGroup
The group to run as, GID must exist in the container image.
int
65534
rest.securityContext.runAsUser
The user to run as, UID must exist in the container image.
int
65534
rest.service.additionalPorts
Additional port definitions to expose.
additionalPorts:
  - name: proxy
    port: 8080
    targetPort: 8080 # optional
    protocol: TCP # optional
list
[]
rest.service.annotations
Additional annotations to apply to rest service.
object
{}
rest.service.clusterIP
string
"None"
rest.service.enabled
Enable the creation of service for rest pods.
bool
true
rest.service.externalTrafficPolicy
The external traffic policy.
string
null
rest.service.labels
Additional labels to apply rest service.
object
{}
rest.service.loadBalancerClass
The load balancer class to use for the rest services.
string
null
rest.service.type
The type of service to create. This defaults to ClusterIP.
string
"ClusterIP"
rest.startupProbe
The startup probe for the rest container.
object
tcpSocket:
    port: slurmrestd
failureThreshold: 20
periodSeconds: 2

rest.terminationGracePeriodSeconds
The termination grace period for the rest pod.
int
5
rest.volumeMounts
Additional volume mounts to apply to the rest pod.
list
[]
rest.volumes
Additional volumes to add to the restd pod.
volumes:
  - name: cache-vol
    emptyDir:
      medium: Memory
list
[]
scheduler.annotations
Additional annotations for scheduler resources.
object
{}
scheduler.config.scheduler.gpuTypes
Mapping of k8s gpu types to Slurm gpu types. The keys represent GPU types required during scheduling from the node affinity using the key “gpu.nvidia.com/class” and the values represent the gres gpu type in Slurm. This gets added to a job’s description.
map
{
  "A100_NVLINK_80GB": "a100",
  "H100_NVLINK_80GB": "h100"
}
scheduler.config.scheduler.pollInterval
The polling interval for the Slurm API.
string
"10s"
scheduler.config.scheduler.terminationOffset
offset termination grace period to account for communication delays etc.
string
"5s"
scheduler.config.slurm.poolSize
The number of connections to be maintained in the connection pool.
int
10
scheduler.config.slurm.protocolVersion
The protocol version to use for communication with the Slurm controller.
string
"25_05"
scheduler.config.slurm.usePersistentConnection
Use Slurm’s persistent connections for connection reuse.
bool
true
scheduler.controllerAddress
The address of the Slurm controller to connect to.
This should be the service address of the controller in host:port format.
string
""
scheduler.enabled
Enable the scheduler.
To schedule k8s pods on the Slurm cluster nodes, this must be enabled.
bool
false
scheduler.hooksAPI
config for the webhooks.
object
{
  "waitForPodDeletionInterval": "1s"
}
scheduler.hooksAPI.waitForPodDeletionInterval
The polling interval when checking for pod deletion.
string
"1s"
scheduler.image
The image to use for the scheduler.
object
repository: registry.gitlab.com/coreweave/sunk/operator
tag:

scheduler.labels
Additional labels for scheduler resources.
object
{}
scheduler.livenessProbe
The liveness probe for the scheduler container.
object
httpGet:
    path: /healthz
    port: 8081
initialDelaySeconds: 15
periodSeconds: 20

scheduler.logLevel
The log level.
Uses integers or zap log level strings:
  • debug
  • info
  • warn
  • error
  • dpanic
  • panic
  • fatal
string
"info"
scheduler.maxConcurrentReconciles
The maximum concurrent reconciles.
This should be adjusted based on the volume of pods using the scheduler to handle bursts operations quickly. The size of both the Slurm and Kubernetes clusters will impact this but less than the syncer. The driving factor here tends to be the pod volume and associated Slurm jobs more than anything else. Using the same value as the syncer should be a rather conservative starting point in many use cases.
int
50
scheduler.name
The name of the scheduler used to select the scheduler during pod creation.
By default the name is based on the namespace and release name <namespace>-<release>-scheduler when not set.
string
null
scheduler.priorityClassName
The priority class name for the scheduler pod.
string
"sunk-control-plane"
scheduler.readinessProbe
The readiness probe for the scheduler container.
object
httpGet:
    path: /readyz
    port: 8081
initialDelaySeconds: 5
periodSeconds: 10

scheduler.resources
Resources for the scheduler container.
object
limits:
    memory: 24Gi
    cpu: 16
requests:
    cpu: 4
    memory: 24Gi

scheduler.scope.namespaces
The list of the namespaces to scope the scheduler to.
Only used when scope.type is set to namespace. Namespaces other than the release namespace will need role bindings created.
list
[.Release.Namespace]
scheduler.scope.type
The type can be cluster or namespace.
string
"namespace"
scheduler.startupProbe
The startup probe for the scheduler container.
object
map[]
secretJob.annotations
Additional annotations for secret Job resources.
object
{}
secretJob.labels
Additional labels for secret Job resources.
object
{}
secretJob.nodeSelector.affinity
The affinity for the secret job. This overrides the value of global.nodeSelector.affinity.
object
null
secretJob.priorityClassName
The priority class name for the secret job pod.
string
"sunk-control-plane"
secretJob.resources
Resources for the secret job container.
object
limits:
    memory: 500Mi
requests:
    cpu: 200m
    memory: 100Mi

secretJob.tolerations
The tolerations for the secret job
list
null
slurm-login
Configure individual login nodes via slurm-login subchart. Below is an example showing some of the key parameters of the subchart, see subchart docs for all parameters.
slurm-login:
  enable: true
  directoryCache:
    # select users from two groups
    selectGroups: ["slum-researches", "slurm-ops"]
    # poll every minutes (default 90s)
    interval: 1m
    # enable nsscache source
    source: nsscache

    SSSD options: directoryService is not needed when using nsscache, use only with SSSD.
    # Google Secure LDAP
    directoryService:
      directories:
        - name: google-example.com
          enabled: true
          ldapUri: ldaps://ldap.google.com:636
          user:
            canary: user@google-example.com
          defaultShell: "/bin/bash"
          fallbackHomeDir: "/home/%u"
          overrideHomeDir: /mnt/nvme/home/%u
          ldapsCert: google-ldaps-cert
          schema: rfc2307bis
objectSee default values in slurm-login subchart.
slurmConfig.AccountingStorageEnforce
Controls what level of association-based enforcement to impose on job submissions.
Multiple values allowed. Valid options are any combination of:
  • associations
  • limits
  • nojobs
  • nosteps
  • qos
  • safe
  • wckeys
Use all to impose everything except nojobs and nosteps, which must be requested separately. See the Slurm documentation for more details.
list
[
  "qos",
  "limits"
]
slurmConfig.AccountingStorageTRES
string
"gres/gpu"
slurmConfig.AccountingStorageType
string
"accounting_storage/slurmdbd"
slurmConfig.AuthAltParameters[0]
string
"jwt_key=/etc/jwt/jwt.key"
slurmConfig.AuthAltTypes[0]
string
"auth/jwt"
slurmConfig.BatchStartTimeout
int
120
slurmConfig.CommunicationParameters
The list of communication parameters to pass to slurmCtld. See the Slurm documentation for possible values.
list
- KeepAliveTime=60
- keepaliveinterval=10
- keepaliveprobes=3

slurmConfig.DebugFlags
Comma-separated debug flags for logging.
string
"NO_CONF_HASH"
slurmConfig.DefMemPerCPU
The default memory per CPU in megabytes.
Sets the slurm.conf parameter of the default real memory size available per usable allocated CPU in megabytes. This value is used when the —mem-per-cpu option is not specified on the srun command line.
int
4096
slurmConfig.Epilog
string
"/usr/share/sunk/bin/epilog.sh"
slurmConfig.GresTypes[0]
string
"gpu"
slurmConfig.InactiveLimit
Terminate job allocation commands, such as srun or salloc, that are unresponsive longer than this interval in seconds. See the slurm.conf reference for more details.
int
0
slurmConfig.JobAcctGatherFrequency
int
30
slurmConfig.JobAcctGatherType
string
"jobacct_gather/cgroup"
slurmConfig.JobCompType
string
"jobcomp/none"
slurmConfig.JobSubmitPlugins
The job submit plugins to use.
list
[]
slurmConfig.KillWait
The interval in seconds between the SIGTERM and SIGKILL signals given to a job’s processes upon reaching its time limit. See the slurm.conf reference for more details.
int
30
slurmConfig.MaxNodeCount
int
3072
slurmConfig.MessageTimeout
int
100
slurmConfig.MinJobAge
int
300
slurmConfig.MpiDefault
string
"pmix"
slurmConfig.ProctrackType
The plugin to be used for process tracking on a job step basis. See the Slurm documentation for more details.
Valid values:
  • proctrack/linuxproc
  • proctrack/cgroup
string
"proctrack/cgroup"
slurmConfig.Prolog
string
"/usr/share/sunk/bin/prolog.sh"
slurmConfig.PrologFlags[0]
string
"Alloc"
slurmConfig.PrologFlags[1]
string
"Serial"
slurmConfig.RebootProgram
string
"/usr/share/sunk/bin/reboot.sh"
slurmConfig.ReturnToService
int
2
slurmConfig.SUNKJobDashboardURL
string
null
slurmConfig.SUNKNodeDashboardURL
string
null
slurmConfig.SchedulerParameters[0]
string
"nohold_on_prolog_fail"
slurmConfig.SchedulerParameters[1]
string
"max_rpc_cnt=256"
slurmConfig.SchedulerType
string
"sched/backfill"
slurmConfig.SelectType
string
"select/cons_tres"
slurmConfig.SelectTypeParameters
The values to use for the parameters of the select/cons_tres plugin.
Allowed values depend on the configured value of SelectType. See the slurm.conf reference for more details.
string
"CR_CPU_MEMORY"
slurmConfig.SlurmSchedLogFile
string
"/dev/null"
slurmConfig.SlurmSchedLogLevel
int
1
slurmConfig.SlurmUser
string
"slurm"
slurmConfig.SlurmctldDebug
string
"verbose"
slurmConfig.SlurmctldLogFile
string
"/dev/null"
slurmConfig.SlurmctldParameters
The list of additional parameters to pass to slurmCtld. See the Slurm documentation for possible values.
list
- idle_on_node_suspend
- node_reg_mem_percent=95

slurmConfig.SlurmctldPidFile
string
"/var/run/slurmctld.pid"
slurmConfig.SlurmctldPort
int
6817
slurmConfig.SlurmctldTimeout
The interval, in seconds, that the backup controller waits for the primary controller to respond before assuming control. The default value is 120 seconds. May not exceed 65533.
int
60
slurmConfig.SlurmdDebug
string
"verbose"
slurmConfig.SlurmdLogFile
string
"/proc/1/fd/1"
slurmConfig.SlurmdPidFile
string
"/var/run/slurmd.pid"
slurmConfig.SlurmdPort
int
6818
slurmConfig.SlurmdSpoolDir
string
"/var/spool/slurmd"
slurmConfig.SlurmdTimeout
The interval, in seconds, that the Slurm controller waits for slurmd to respond before configuring that node’s state to DOWN.
int
60
slurmConfig.StateSaveLocation
string
"/var/spool/slurmctld/save"
slurmConfig.SuspendTime
Nodes which remain idle or down for this number of seconds will be placed into power save mode by SuspendProgram.
string
"INFINITE"
slurmConfig.SwitchType
string
"switch/none"
slurmConfig.TCPTimeout
int
15
slurmConfig.TaskPlugin
The task plugin to use. See the Slurm documentation for more details.
Multiple comma-separated values allowed. Valid values:
  • task/affinity
  • task/cgroup
  • task/none
string
"task/cgroup,task/affinity"
slurmConfig.TaskPluginParam
Optional parameters for the task plugin. See the Slurm documentation for more details.
string
"SlurmdSpecOverride"
slurmConfig.TopologyParam[0]
string
"TopoOptional"
slurmConfig.TopologyPlugin
string
"topology/tree"
slurmConfig.TreeWidth
int
65533
slurmConfig.UnkillableStepProgram
string
"/usr/share/sunk/bin/unkillable-step.sh"
slurmConfig.UnkillableStepTimeout
int
900
slurmConfig.WaitTime
Specifies how many seconds the srun command should wait after the first task terminates before terminating all remaining tasks.
Using the —wait option on the srun command line overrides this value. The default value is 0, which disables this feature. See the slurm.conf reference for more details.
int
0
slurmConfig.cgroupConfig
The cgroup.conf value.
This is only used when ProctrackType is set to proctrack/cgroup. Note: cgroup/v2 should be used over autodetect on systems using cgroup v2.
object
CgroupPlugin: autodetect
IgnoreSystemd: yes
ConstrainCores: yes
ConstrainDevices: yes
ConstrainRAMSpace: yes

sssdContainer.enabled
Enable the sssd sidecar container.
bool
false
sssdContainer.livenessProbe
The liveness probe for the sssd container.
object
map[]
sssdContainer.readinessProbe
The readiness probe for the sssd container.
object
map[]
sssdContainer.startupProbe
The startup probe for the sssd container.
object
map[]
syncer.annotations
Additional annotations for syncer resources.
object
{}
syncer.config.slurm.poolSize
The number of connections to be maintained in the connection pool.
int
10
syncer.config.slurm.protocolVersion
The protocol version to use for communication with the Slurm controller.
string
"25_05"
syncer.config.slurm.usePersistentConnection
Use Slurm’s persistent connections for connection reuse.
bool
true
syncer.config.syncer.nodesetUpdateJobPreemption
Configuration for job preemption support. More details can be found in the changelog
object
{
  "enabled": false,
  "method": null
}
syncer.config.syncer.nodesetUpdateJobPreemption.enabled
Enable job preemption support.
bool
false
syncer.config.syncer.nodesetUpdateJobPreemption.method
Job preemption strategy during rolling upgrades, can be set to one of the following methods:
  • partition Preempt jobs in specific partitions. A comma-separated list of partition names can be specified in partitions.
  • qos Preempt jobs based on their QoS. A comma-separated list of QoS names can be specified in qos.
  • time Preempt jobs if the time since the rolling delete condition has been on the pod, is greater than the set time limit. Time limit in seconds can be set in timeLimit.
string
null
syncer.config.syncer.orphanedPodDelay
The delay to wait before deleting a pod that is no longer associated with a Slurm node.
string
"120s"
syncer.config.syncer.pollInterval
The polling interval for the Slurm API.
string
"10s"
syncer.config.syncer.qosInterruptable
The externally defined label to indicate if pod is interruptable.
string
"qos.coreweave.cloud/interruptable"
syncer.config.syncer.reconfigureRateLimit
The rate limit, in seconds, for Slurm reconfigure requests based on additions to NodeSlices. The value must be above 0 seconds to enable this feature. Warning: if this value is too low, scontrol reconfigure may be executed too often, especially during periods when several nodes are newly added.
string
"3600s"
syncer.config.syncer.slurmNodeCleanUp
Removes lingering Slurm nodes from the cluster after they have been removed from their associated SUNK NodeSets.
bool
true
syncer.controllerAddress
The address of the Slurm controller to connect to.
This should be the service address of the controller in host:port format.
string
""
syncer.enabled
Enable the syncer.
This is required for most functionality and should only be disabled for troubleshooting.
bool
true
syncer.hooksAPI
config for the webhooks.
object
{
  "nodeRebootCondition": "PhaseState",
  "nodeRebootReason": "production-powerreset",
  "safeNodeRebootCondition": "PendingPhaseState",
  "safeNodeRebootReason": "production-powerreset",
  "waitForNodeLockedInterval": "1s",
  "waitForNodeLockedTimeout": "120s"
}
syncer.hooksAPI.nodeRebootCondition
Condition to indicate node should be rebooted.
string
"PhaseState"
syncer.hooksAPI.nodeRebootReason
The target NLCC lifecycle state associated with the nodeRebootCondition.
string
"production-powerreset"
syncer.hooksAPI.safeNodeRebootCondition
Condition to indicate node should be rebooted safely.
string
"PendingPhaseState"
syncer.hooksAPI.safeNodeRebootReason
The target NLCC lifecycle state associated with the safeNodeRebootCondition.
string
"production-powerreset"
syncer.hooksAPI.waitForNodeLockedInterval
The polling interval when checking for node locked state.
string
"1s"
syncer.hooksAPI.waitForNodeLockedTimeout
The timeout for checking node locked state.
string
"120s"
syncer.image
The image to use for the syncer.
object
repository: registry.gitlab.com/coreweave/sunk/operator
tag:

syncer.labels
Additional labels for syncer resources.
object
{}
syncer.livenessProbe
The liveness probe for the syncer container.
object
httpGet:
    path: /healthz
    port: 8081
initialDelaySeconds: 15
periodSeconds: 20

syncer.logLevel
The log level.
Uses integers or zap log level strings:
  • debug
  • info
  • warn
  • error
  • dpanic
  • panic
  • fatal
string
"info"
syncer.maxConcurrentReconciles
The maximum concurrent reconciles.
This should be adjusted based on the number of nodes and size of jobs launched in the Slurm cluster, to handle bursts operations quickly. A value 1/10th the number of nodes in the cluster is a good starting point for small clusters. As cluster size increases, this value can be a smaller fraction of the total number of nodes in most cases. For instance a value of 50 seems to handle a 2000 node cluster well. Being too aggressive here will bottleneck on other components such as the Kubernetes API server and the Slurm controller, which in some cases may cause errors.
int
50
syncer.nodePermissions.enabled
Enable node operations on the syncer, currently this allows restart of nodes when enabled.
bool
true
syncer.priorityClassName
The priority class name for the syncer pod.
string
"sunk-control-plane"
syncer.readinessProbe
The readiness probe for the syncer container.
object
httpGet:
    path: /readyz
    port: 8081
initialDelaySeconds: 5
periodSeconds: 10

syncer.resources
Resources for the syncer container.
object
limits:
    memory: 24Gi
    cpu: 16
requests:
    cpu: 4
    memory: 24Gi

syncer.startupProbe
The startup probe for the syncer container.
object
map[]
syncer.watchAllNodeSets
Watch all NodeSets in the namespace.
This overrides default behavior of only watching the NodeSets deployed with this chart release.
bool
false
syncer.watchNodeSets
The list of NodeSets to watch.
This overrides the default behavior of watching the NodeSets deployed with this chart release to instead watch this specific list. This is not used if watchAllNodeSets is set to true.
list
[]
userLookupContainer.livenessProbe
The liveness probe for the user-lookup container.
object
map[]
userLookupContainer.readinessProbe
The readiness probe for the user-lookup container.
object
map[]
userLookupContainer.startupProbe
The startup probe for the user-lookup container.
object
map[]
Last modified on April 20, 2026