Skip to main content

Slurm Parameter Reference

Version: 0.1.0 Type: application AppVersion: 24.11

Requirements

RepositoryNameVersion
file://../slurm-loginslurm-login0.1.0
oci://registry-1.docker.io/bitnamichartsmysql9.19.1

Parameters

Key & DescriptionTypeDefault

accounting.annotations
Additional annotations for accounting resources.

object
{}

accounting.config.slurmdbdExtraConfig
Multi-line string of additional slurmdbd.conf file config.

string
accounting.config.slurmdbdExtraConfig: |
ArchiveEvents=yes
ArchiveJobs=yes
ArchiveResvs=yes
ArchiveSteps=no
ArchiveSuspend=no
ArchiveTXN=no
ArchiveUsage=no
PurgeEventAfter=1month
PurgeJobAfter=12month
PurgeResvAfter=1month
PurgeStepAfter=1month
PurgeSuspendAfter=1month
PurgeTXNAfter=12month
PurgeUsageAfter=24month

accounting.enabled
Enable the accounting.

bool
true

accounting.external.enabled
Enable the external accounting, instead of deploying an internal accounting instance.

bool
false

accounting.external.host
The host of the external accounting instance: IP or hostname.

string
null

accounting.external.port
The port of the external accounting instance.

string
null

accounting.external.user
The user to use to authenticate to the external accounting instance.

string
null

accounting.image
The image to use for slurmdbd deployment.

object
repository: registry.gitlab.com/coreweave/sunk/controller
tag:

accounting.labels
Additional labels for accounting resources.

object
{}

accounting.livenessProbe
The liveness probe for the slurmdbd container.

object
exec:
command:
- sacctmgr
- ping
initialDelaySeconds: 15
periodSeconds: 10
failureThreshold: 5
successThreshold: 1

accounting.priorityClassName
The priority class name for the accounting pod.

string
null

accounting.readinessProbe
The readiness probe for the slurmdbd container.

object
exec:
command:
- sacctmgr
- ping
initialDelaySeconds: 15
periodSeconds: 10
failureThreshold: 5
successThreshold: 1

accounting.replicas
The number of replicas of the accounting instance to run.

int
1

accounting.resources
Resources for the accounting container.

object
limits:
cpu: 4
memory: 16Gi
requests:
cpu: 4
memory: 16Gi

accounting.securityContext.runAsGroup
The group to run as, must match the slurm GID from the container image.

int
401

accounting.securityContext.runAsUser
The user to run as, must match the slurm UID from the container image.

int
401

accounting.startupProbe
The startup probe for the slurmdbd container.

object
null

accounting.terminationGracePeriodSeconds
The termination grace period for the accounting pod.

int
30

accounting.useExistingSecret
Use an existing secret for the accounting instance instead of creating.
The secret name is the same as the mysql.auth.existingSecret.

bool
false

accounting.volumeMounts
Additional volume mounts to apply to the accounting pod.

list
[]

accounting.volumes
Additional volumes to mount to the accounting pod.

list
[]

compute.annotations
Additional annotations for compute services only. Use compute.nodes.custom-definition.annotations to add annotations to specific node definitions instead.

object
{}

compute.autoPartition.enabled
Enable the auto partition.

bool
true

compute.externalClusterName
The name of an external cluster to join.
This is used when control plane is deployed separately.

string
null

compute.generateTopology
Enable topology generation for the compute nodes in the cluster.

bool
true

compute.initialState
The initial state for the nodes when they join the slurm cluster.
This is generally drain or idle. May also be set per node definition.

string
"idle"

compute.initialStateReason
The reason for setting the initial state of the nodes to down, drained, or fail.
May also be set per node definition.

string
"Node added to the cluster for the first time"

compute.labels
Additional labels for compute services only. Use compute.nodes.custom-definition.labels to add labels to specific node definitions instead.

object
{}

compute.livenessProbe
The liveness probe for the compute slurmd container.

object
map[]

compute.maxUnavailable
The maximum unavailability of the compute nodes during a rolling update.
Can be percentage or a number.

string
"10%"

compute.nodes
Multiple node definitions can be declared, but only one may be enabled: true. Node definitions can reference other definitions to include or overlay values. See the example below or the Compute Node Definitions documentation for more details. TODO -- Update with a real example for the Open Source release.

Example: click to expand
Example
compute:
nodes:
# A custom definition to be referenced by other nodes
custom-dns:
dnsPolicy: "None"
dnsConfig:
nameservers:
- 127.0.0.1
# A simple CPU-only Node that uses the custom-dns definition above
simple-cpu:
enabled: true
replicas: 1
definitions:
# Use the custom-dns definition
- custom-dns
staticFeatures:
- cpu
dynamicFeatures:
node.coreweave.cloud/class: {}
image:
repository: registry.gitlab.com/coreweave/sunk/controller-extras
gresGpu: null
config:
weight: 1
# Create a small node with 1cpu and 1g memory
resources:
limits:
memory: 1Gi
cpu: 1
requests:
memory: 1Gi
cpu: 1
tolerations:
- key: is_cpu_compute
operator: Exists
volumeMounts:
- name: ramtmp
mountPath: /tmp
volumes:
- name: ramtmp
emptyDir:
medium: Memory
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/os
operator: In
values:
- linux
objectSee Compute Node Definitions.

compute.partitions
Partitions to add to the cluster.

string
compute.partitions: |
PartitionName=all Nodes=ALL Default=YES MaxTime=INFINITE State=UP

compute.plugstackConfig
Additional plug-in stack configuration items for plugstack.conf file config. Config Options: https://slurm.schedmd.com/spank.html#SECTION_CONFIGURATION

list
[]

compute.pyxis.enabled
Enable the pyxis container.

bool
false

compute.pyxis.mountHome
Enables ENROOT_MOUNT_HOME for the pyxis container to mount the home directory.

bool
true

compute.pyxis.plugstackOptions
Additional arguments for the pyxis plugin in plugstack.conf file config. Config Options: https://github.com/NVIDIA/pyxis/wiki/Setup#slurm-plugstack-configuration

list
[]

compute.pyxis.remapRoot
Enables ENROOT_REMAP_ROOT for the pyxis container to remap the root user.

bool
true

compute.readinessProbe
The readiness probe for the compute slurmd container.

object
exec:
command:
- scontrol
- show
- slurmd
failureThreshold: 3
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5

compute.s6
oneshot and longrun jobs are supported. See Running Scripts with S6 for more information.

Example: click to expand
Example
s6:
packages:
type: oneshot
timeoutUp: 0
timeoutDown: 0
script: |
#!/usr/bin/env bash
apt -y install nginx
nginx:
type: longrun
timeoutUp: 0
timeoutDown: 0
script: |
#!/usr/bin/env bash
nginx -g "daemon off;"
object
{}

compute.securityContext.capabilities.add
Add capabilities to the slurmd container.
"SYS_ADMIN" is required if using pyxis.

Example: click to expand
Example
compute:
securityContext:
capabilities:
add: ["SYS_ADMIN"]
list
[]

compute.ssh.enabled
Enable ssh to the compute nodes.

bool
false

compute.startupProbe
The startup probe for the compute slurmd container.

object
map[]

compute.volumeMounts
Additional volume mounts to add to all the compute pods, also added to login pods.

list
[]

compute.volumes
Additional volumes to mount to all the compute pods, also added to login pods.

list
[]

controlPlane.enabled
Enable the Slurm control plane.
Unless splitting the deployment this should be enabled.

bool
true

controller.annotations
Additional annotations for controller resources.

object
{}

controller.enabled
Enable the controller deployment
This should be enabled unless more complicated deployment is required (splitting the deployment).

bool
true

controller.image
The image to use for the controller.

object
repository: registry.gitlab.com/coreweave/sunk/controller
tag:

controller.jobSkipIds
The controller skips processing this list of Slurm JobIds.

list
[]

controller.labels
Additional labels for controller resources.

object
{}

controller.livenessProbe
The liveness probe for the controller.

object
exec:
command:
- scontrol
- ping
failureThreshold: 6
initialDelaySeconds: 15
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 10

controller.priorityClassName
The priority class name for the controller.

string
null

controller.readinessProbe
The readiness probe for the controller.

object
map[]

controller.replicas
The number of replicas of the controller to run, currently should be left at 1.

int
1

controller.resources
Resources for the controller container.

object
limits:
cpu: 4
memory: 16Gi
requests:
cpu: 4
memory: 16Gi

controller.securityContext.runAsGroup
The group to run as, must match the slurm GID from the container image.

int
401

controller.securityContext.runAsUser
The user to run as, must match the slurm UID from the container image.

int
401

controller.startupProbe
The startup probe for the controller.

object
map[]

controller.stateVolume.size
The size of the persistent volume claim.

string
"32Gi"

controller.stateVolume.storageClassName
The storage class name to use for the volume.

string
null

controller.terminationGracePeriodSeconds
The termination grace period for the controller.

int
30

controller.volumeMounts
Additional volume mounts to apply to the controller pod.

list
[]

controller.volumes
Additional volumes to mount to the controller pod.

list
[]

controller.watch.enabled
Enable watching the Slurm configuration and triggering a reconfigure when there are changes.

bool
true

controller.watch.interval
The interval in seconds to check for changes in the Slurm configuration.

int
60

controller.watch.livenessProbe
The liveness probe for the watch container.

object
null

controller.watch.readinessProbe
The readiness probe for the watch container.

object
null

controller.watch.startupProbe
The startup probe for the watch container.

object
null

controller.watch.topologyFileInterval
The interval in seconds to check for changes specifically in the topology.conf file. Warning: if this value is too low, scontrol reconfigure may be executed too often, especially during periods when several nodes are newly added.

int
3600

directoryService.debugLevel
A bit mask of what SSSD debug levels to enable.

int
0x01F0

directoryService.directories
The directory services to configure. Click to expand examples.

Google Secure LDAP
Example
directories:
- name: google-example.com
enabled: true
ldapUri: ldaps://ldap.google.com:636
user:
defaultShell: "/bin/bash"
fallbackHomeDir: "/home/%u"
overrideHomeDir: /mnt/nvme/home/%u
ldapsCert: google-ldaps-cert
schema: rfc2307bis
CoreWeave LDAP
Example
directories:
- name: coreweave.cloud
enabled: true
ldapUri: ldap://openldap
user:
bindDn: cn=admin,dc=coreweave,dc=cloud
searchBase: dc=coreweave,dc=cloud
existingSecret: bind-user-sssd-config
canary: admin
defaultShell: "/bin/bash"
fallbackHomeDir: "/home/%u"
schema: rfc2307
Authentik
Example
directories:
- name: coreweave.cloud
enabled: true
ldapUri: ldap://authentik-outpost-ldap-outpost
user:
bindDn: cn=ldapsvc,dc=coreweave,dc=cloud
searchBase: dc=coreweave,dc=cloud
existingSecret: bind-user-sssd-config
canary: ldapsvc
startTLS: true
userObjectClass: user
groupObjectClass: group
userNameAttr: cn
groupNameAttr: cn
schema: rfc2307bis
Active Directory
Example
directories:
- name: contoso.com
enabled: true
ldapUri: ldap://domaincontroller.tenant-my-tenant.coreweave.cloud
user:
bindDn: CN=binduser,CN=Users,DC=contoso,DC=com
searchBase: DC=contoso,DC=com
existingSecret: bind-user-sssd-config
canary: binduser
defaultShell: "/bin/bash"
fallbackHomeDir: "/home/%u"
schema: AD
list 

directoryService.directories[0].additionalConfig
Multi-line string of additional arbitrary config per domain for sssd.

Example: click to expand
Example
additionalConfig: |
ldap_foo = bar
string
null

directoryService.directories[0].defaultShell
The default user shell.

string
"/bin/bash"

directoryService.directories[0].enabled
Enable the directory service.

bool
false

directoryService.directories[0].fallbackHomeDir
The fallback user home directory.

string
"/home/%u"

directoryService.directories[0].ignoreGroupMembers
This overrides SSSD configuration of the same name
If set to true, SSSD only retrieves information about the group objects
themselves and not their members, providing a significant performance boost.
If omitted, defaults to true.

bool
null

directoryService.directories[0].ldapUri
The LDAP URI to use for the directory service.
Example: ldap://YOUR_LDAP_IP
For Google Secure LDAP, use: ldaps://ldap.google.com:636

string
null

directoryService.directories[0].ldapsCert
Name of existing TLS certificate for LDAP-S.

Example: click to expand
Example
kubectl create secret tls google-ldaps-cert \
--cert=Google_2025_08_24_55726.crt \
--key=Google_2025_08_24_55726.key
string
null

directoryService.directories[0].name
Name of the directory service.
The primary domain should always be named: default

string
"default"

directoryService.directories[0].overrideGidAttr
Override the default schema LDAP attribute that corresponds to the user's primary group id.
Example: posixGid

string
null

directoryService.directories[0].overrideHomeDir
Override the homeDirectory attribute from LDAP with a provided path.
Example: /mnt/nvme/home/%u

string
null

directoryService.directories[0].overrideUidAttr
Override the default schema LDAP attribute that corresponds to the user's id.
Example: posixUid

string
null

directoryService.directories[0].overrideUserNameAttr
Override the default schema LDAP attribute that corresponds to the user's login name.
Example: employeeNumber

string
null

directoryService.directories[0].schema
The desired LDAP schema for the directory service. Valid values:

  • AD
  • POSIX
  • rfc2307bis
Note: For Google Secure LDAP, use rfc2307bis.

string
"AD"

directoryService.directories[0].user.bindDn
The LDAP bind DN to use for the directory service.
Where bindDn is not required (e.g. Google Secure LDAP), only supply user.canary.
Example: cn=Admin,ou=Users,ou=CORP,dc=corp,dc=example,dc=com

string
null

directoryService.directories[0].user.canary
The username to lookup to confirm LDAP is working.

string
null

directoryService.directories[0].user.existingSecret
Name of an existing secret containing an SSSD configuration snippet with the ldap_default_authtok set for this domain.

string
null

directoryService.directories[0].user.existingSecretFileName
The name of the file in the existing secret that contains the ldap passwords.

string
"ldap-password.conf"

directoryService.directories[0].user.groupSearchBase
The LDAP group search base to use for the directory service.
Example: ou=groups,dc=example,dc=com

string
null

directoryService.directories[0].user.password
The password to use for the directory service lookups.

string
null

directoryService.directories[0].user.searchBase
The LDAP search base to use for the directory service.
Example: dc=corp,dc=example,dc=com

string
null

directoryService.negativeCacheTimeout
Negative caching value (in seconds).
Determines how long an invalid entry will be cached before asking LDAP again. This improves the directory listing time when a primary gid cannot be found.

string
"600"

directoryService.sudoGroups
List of Unix groups from all directories with sudo privileges.
Group names are fully-qualified for additional directories. Group names are not fully-qualified for the default directory; (e.g. "group1" instead of "[email protected]")

list
[]

directoryService.watchInterval
The interval in seconds to check for changes in sssd configuration.

int
60

global.annotations
Additional annotations to apply to all resources.

object
{}

global.dnsConfig.additionalSearches
A list of namespaces to add to the list of DNS searches. These additional searches extend hostname lookup in the control-plane, compute, and login pods. Default dns searches: - name-compute.namespace.svc.cluster.local - slurm_cluster_name-controller.namespace.svc.cluster.local

list
[]

global.imagePullPolicy
The image pull policy for all containers.

string
"IfNotPresent"

global.labels
Additional labels to apply to the all resources.

object
{}

global.nodeSelector.affinity
Sets a global affinity for all Slurm node pods. This can be overridden for specific Slurm nodes in their configuration.

object
null

global.volumeMounts
The list of volume mounts to apply to all compute, controller, accounting, and login pods

list
[]

global.volumes
The list of volumes to mount to all compute, controller, accounting, and login pods

list
[]

imagePullSecrets
The list of secrets used to access images in a private registry.

list
[]

jwt.existingSecret
The name of an existing secret containing the JWT private key, otherwise the chart will generate one.

string
null

login.annotations
Additional annotations.

object
{}

login.automountServiceAccountToken
Automatically mount the service account token into the login pod.

bool
false

login.containers
Additional sidecar containers to add to the login pod.

list
[]

login.enabled
Enable the login nodes

bool
true

login.env
Additional environment variables to pass to the sshd container.

list
[]

login.hostAliases
Provides Pod-level override of hostname resolution when DNS and other options are not applicable in login pods. See Adding entries to Pod /etc/hosts with HostAliases for more information.

list
[]

login.image
The image to use for the login node.

object
repository: registry.gitlab.com/coreweave/sunk/controller-extras
tag:

login.individualResources
Resources for the slurm-login pod sshd container.

object
limits:
memory: 2Gi
requests:
cpu: 500m
memory: 300Mi

login.labels
Additional labels.

object
{}

login.nodeSelector.affinity
The affinity for the login nodes. This overrides the value of global.nodeSelector.affinity.

object
null

login.priorityClassName
The priority class name for the login pod.

string
null

login.replicas
The number of replicas of the login node.
When running more than one, a pod specific-service is created for each one in addition to the main service.

int
1

login.resources
Resources for the login node sshd container.

object
limits:
memory: 8Gi
requests:
cpu: 4
memory: 8Gi

login.s6
oneshot and longrun jobs are supported. See Running Scripts with S6 for more information.

Example: click to expand
Example
s6:
packages:
type: oneshot
timeoutUp: 0
timeoutDown: 0
script: |
#!/usr/bin/env bash
apt -y install nginx
nginx:
type: longrun
timeoutUp: 0
timeoutDown: 0
script: |
#!/usr/bin/env bash
nginx -g "daemon off;"
object
{}

login.service.additionalPorts
Additional port definitions to expose.

Example: click to expand
Example
additionalPorts:
- name: eternal-shell
port: 2022
targetPort: 20222 # optional
protocol: TCP # optional
list
[]

login.service.enabled
Enable the creation of service(s) for login pods.

bool
true

login.service.externalTrafficPolicy
The external traffic policy.

string
"Local"

login.service.loadBalancerClass
The load balancer class to use for the login services.

string
null

login.service.metadata.0.annotations
Additional annotations to apply to the first login service (0).

object
{}

login.service.metadata.0.labels
Additional labels to apply to the common first login service (0).

object
{}

login.service.metadata.common.annotations
Additional annotations to apply to the common login service.

object
{}

login.service.metadata.common.labels
Additional labels to apply to the common login service.

object
{}

login.service.metadata.global.annotations
Additional annotations to apply to all login services.

object
{}

login.service.metadata.global.labels
Additional labels to apply to all login services.

object
{}

login.service.type
The type of service to create. This defaults to LoadBalancer for cloud deployments. For development and test systems without an external load balancer to handle the service routing, such as when deploying on kind (Kubernetes IN Docker), this may be set to ClusterIP.

string
"LoadBalancer"

login.serviceAccountName
The service account name to use for the login pod.

string
"default"

login.sshKeyVolume.accessModes
The access mode for the storage. If scaling login beyond 1 replica, this must be ReadWriteMany. In a development setting with a volume provider that doesn't support ReadWriteMany, such as kind (Kubernetes IN Docker), this may be set to ReadWriteOnce.

string
[
"ReadWriteMany"
]

login.sshKeyVolume.enabled
Enable the ssh key volume, to allow keys to be mounted and persisted in the login pod.
If this is disabled the host keys for the login pod will be regenerated on each container restart.

bool
true

login.sshKeyVolume.size
The size of the persistent volume claim.

string
"1Gi"

login.sshKeyVolume.storageClassName
The storage class name to use for the volume.

string
null

login.sshdConfig
Additional sshd configuration to add to the login pod.

Example: click to expand
Example
sshdConfig: |
PasswordAuthentication no
string
null

login.sshdLivenessProbe.config
The liveness probe for the login sshd container.

object
failureThreshold: 10
initialDelaySeconds: 10
periodSeconds: 5
tcpSocket:
port: 22

login.sshdLivenessProbe.enabled
If the liveness probe for the login sshd is enabled

bool
%!s(bool=false)

login.sshdReadinessProbe.config
The readiness probe for the login sshd container.

object
map[]

login.sshdReadinessProbe.enabled
If the readiness probe for the login sshd container is enabled

bool
%!s(bool=false)

login.sshdStartupProbe.config
The startup probe for the login sshd container.

object
map[]

login.sshdStartupProbe.enabled
If the startup probe for the login sshd container is enabled

bool
%!s(bool=false)

login.terminationGracePeriodSeconds
The termination grace period for the login pod.

int
30

login.updateStrategy
The update strategy for the login node- Default is type is RollingUpdate

object
{}

login.volumeMounts
Additional volume mounts to apply to the login pod.

list
[]

login.volumes
Additional volumes to add to the login pod.

Example: click to expand
Example
volumes:
- name: cache-vol
emptyDir:
medium: Memory
list
[]

munge.args
The additional arguments to pass to the munge container. The defaults run Munge with 10 threads instead of 2.

list
[
"--num-threads",
"10"
]

munge.livenessProbe
The liveness probe for the munge container.

object
map[]

munge.readinessProbe
The readiness probe for the munge container.

object
map[]

munge.resources
Resources for the munge container.

object
limits:
cpu: 1
memory: 2Gi

munge.securityContext.runAsGroup
The group to run as, must match the munge GID from the container image.

int
400

munge.securityContext.runAsUser
The user to run as, must match the munge UID from the container image.

int
400

munge.startupProbe
The startup probe for the munge container.

object
map[]

mysql
Options for Bitnami MySQL chart, uses Bitnami default values.

objectSee Bitnami default values.

rest.annotations
Additional annotations for REST API resources.

object
{}

rest.args
The additional arguments to pass to the rest container. Defaults enable debug logging and only load most recent openAPI plugins.

list
[
"-vv",
"-sslurmdbd,slurmctld",
"-dv0.0.40"
]

rest.enabled
Enable the REST API deployment
This is optional and should be disabled for most use cases.

bool
false

rest.env
The additional environment variables to pass to the rest container.

list
[
{
"name": "SLURMRESTD_JSON",
"value": "compact"
}
]

rest.image
The image to use for the REST API deployment.

object
repository: registry.gitlab.com/coreweave/sunk/controller
tag:

rest.labels
Additional labels for REST API resources.

object
{}

rest.livenessProbe
The liveness probe for the rest container.

object
tcpSocket:
port: slurmrestd
failureThreshold: 2
periodSeconds: 10

rest.priorityClassName
The priority class name for the rest pod.

string
null

rest.readinessProbe
The readiness probe for the rest container.

object
tcpSocket:
port: slurmrestd
periodSeconds: 5
failureThreshold: 1

rest.replicas
The number of replicas of the rest pod to run.
In most production environments this should be set to a minimum of 2 to provide HA.

int
1

rest.resources
Resources for the slurmrestd container.
These defaults are appropriate for small and medium-sized clusters.

object
limits:
cpu: 2
memory: 8Gi
requests:
cpu: 2
memory: 8Gi

rest.securityContext.runAsGroup
The group to run as, must match the slurm GID from the container image.

int
401

rest.securityContext.runAsUser
The user to run as, must match the slurm UID from the container image.

int
401

rest.startupProbe
The startup probe for the rest container.

object
tcpSocket:
port: slurmrestd
failureThreshold: 20
periodSeconds: 2

rest.terminationGracePeriodSeconds
The termination grace period for the rest pod.

int
5

scheduler.annotations
Additional annotations for scheduler resources.

object
{}

scheduler.config.scheduler.gpuTypes
Mapping of k8s gpu types to Slurm gpu types. The keys represent GPU types required during scheduling from the node affinity using the key "gpu.nvidia.com/class" and the values represent the gres gpu type in Slurm. This gets added to a job's description.

map
{
"A100_NVLINK_80GB": "a100",
"H100_NVLINK_80GB": "h100"
}

scheduler.config.scheduler.pollInterval
The polling interval for the Slurm API.

string
"10s"

scheduler.config.scheduler.terminationOffset
offset termination grace period to account for communication delays etc.

string
"5s"

scheduler.controllerAddress
The address of the Slurm controller to connect to.
This should be the service address of the controller in host:port format.

string
""

scheduler.enabled
Enable the scheduler.
To schedule k8s pods on the Slurm cluster nodes, this must be enabled.

bool
false

scheduler.hooksAPI
config for the webhooks.

object
{
"waitForPodDeletionInterval": "1s"
}

scheduler.hooksAPI.waitForPodDeletionInterval
The polling interval when checking for pod deletion.

string
"1s"

scheduler.image
The image to use for the scheduler.

object
repository: registry.gitlab.com/coreweave/sunk/operator
tag:

scheduler.labels
Additional labels for scheduler resources.

object
{}

scheduler.livenessProbe
The liveness probe for the scheduler container.

object
httpGet:
path: /healthz
port: 8081
initialDelaySeconds: 15
periodSeconds: 20

scheduler.logLevel
The log level.
Uses integers or zap log level strings:

  • debug
  • info
  • warn
  • error
  • dpanic
  • panic
  • fatal

string
"info"

scheduler.maxConcurrentReconciles
The maximum concurrent reconciles.
This should be adjusted based on the volume of pods using the scheduler to handle bursts operations quickly. The size of both the Slurm and Kubernetes clusters will impact this but less than the syncer. The driving factor here tends to be the pod volume and associated Slurm jobs more than anything else. Using the same value as the syncer should be a rather conservative starting point in many use cases.

int
5

scheduler.name
The name of the scheduler used to select the scheduler during pod creation.
By default the name is based on the namespace and release name <namespace>-<release>-scheduler when not set.

string
null

scheduler.priorityClassName
The priority class name for the scheduler pod.

string
null

scheduler.readinessProbe
The readiness probe for the scheduler container.

object
httpGet:
path: /readyz
port: 8081
initialDelaySeconds: 5
periodSeconds: 10

scheduler.resources
Resources for the scheduler container.

object
limits:
cpu: "1"
memory: 1Gi
requests:
cpu: 200m
memory: 1Gi

scheduler.scope.namespaces
The list of the namespaces to scope the scheduler to.
Only used when scope.type is set to namespace. Namespaces other than the release namespace will need role bindings created.

list
[.Release.Namespace]

scheduler.scope.type
The type can be cluster or namespace.

string
"namespace"

scheduler.startupProbe
The startup probe for the scheduler container.

object
map[]

secretJob.annotations
Additional annotations for secret Job resources.

object
{}

secretJob.labels
Additional labels for secret Job resources.

object
{}

secretJob.nodeSelector.affinity
The affinity for the secret job. This overrides the value of global.nodeSelector.affinity.

object
null

secretJob.priorityClassName
The priority class name for the secret job pod.

string
null

secretJob.resources
Resources for the secret job container.

object
limits:
cpu: 500m
memory: 500Mi
requests:
cpu: 200m
memory: 100Mi

secretJob.tolerations
The tolerations for the secret job

list
null

slurm-login
Configure individual login nodes via slurm-login subchart. Below is an example showing some of the key parameters of the subchart, see subchart docs for all parameters.

Example: click to expand
Example
slurm-login:
enable: true
directoryCache:
# select users from two groups
selectGroups: ["slum-researches", "slurm-ops"]
# poll every minutes (default 90s)
interval: 1m
# Google Secure LDAP
directoryService:
directories:
- name: google-example.com
enabled: true
ldapUri: ldaps://ldap.google.com:636
user:
defaultShell: "/bin/bash"
fallbackHomeDir: "/home/%u"
overrideHomeDir: /mnt/nvme/home/%u
ldapsCert: google-ldaps-cert
schema: rfc2307bis
objectSee default values in slurm-login subchart.

slurmConfig.cgroupConfig
The cgroup.conf value.
This is only used when procTrackType is set to proctrack/cgroup.

string
slurmConfig.cgroupConfig: |
CgroupPlugin=autodetect
IgnoreSystemd=yes
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes

slurmConfig.defMemPerCPU
The default memory per CPU in megabytes.
Sets the slurm.conf parameter of the default real memory size available per usable allocated CPU in megabytes. This value is used when the --mem-per-cpu option is not specified on the srun command line.

int
4096

slurmConfig.extraConfig
Multi-line string of free text configuration to be appended to slurm.conf.

Example: click to expand
Example
extraConfig: |
# Config to be appended to slurm.conf
# Can be multiple lines
string
null

slurmConfig.inactiveLimit
Terminate job allocation commands, such as srun or salloc, that are unresponsive longer than this interval in seconds. See the slurm.conf reference for more details.

int
0

slurmConfig.killWait
The interval in seconds between the SIGTERM and SIGKILL signals given to a job's processes upon reaching its time limit. See the slurm.conf reference for more details.

int
30

slurmConfig.poolSize
The number of connections to be maintained in the connection pool.

int
10

slurmConfig.protocolVersion
The protocol version to use for communication with the Slurm controller.

string
"24_11"

slurmConfig.selectTypeParameters
The values to use for the parameters of the select/cons_tres plugin.
Allowed values depend on the configured value of SelectType. See the slurm.conf reference for more details.

string
"CR_Core"

slurmConfig.slurmCtld.accountingStorageEnforce
Controls what level of association-based enforcement to impose on job submissions.
Multiple comma-separated values allowed. Valid options are any combination of:

  • associations
  • limits
  • nojobs
  • nosteps
  • qos
  • safe
  • wckeys
Use all to impose everything except nojobs and nosteps, which must be requested separately. See the Slurm documentation for more details.

string
"qos,limits"

slurmConfig.slurmCtld.additionalParameters
The list of additional parameters to pass to slurmCtld. See the Slurm documentation for possible values.

list
- idle_on_node_suspend
- node_reg_mem_percent=95

slurmConfig.slurmCtld.etcConfigMap
A ConfigMap with keys mapping to files in /etc/slurm on the controller only.
This ConfigMap must not contain:

  • slurm.conf
  • plugstack.conf
  • gres.conf
  • cgroup.conf
  • topology.conf

string
null

slurmConfig.slurmCtld.jobSubmitPlugins
The job submit plugins to use.
Multiple comma-separated values allowed.

string
null

slurmConfig.slurmCtld.procTrackType
The plugin to be used for process tracking on a job step basis. See the Slurm documentation for more details.
Valid values:

  • proctrack/linuxproc
  • proctrack/cgroup

string
"proctrack/cgroup"

slurmConfig.slurmCtld.taskPlugin
The task plugin to use. See the Slurm documentation for more details.
Multiple comma-separated values allowed. Valid values:

  • task/affinity
  • task/cgroup
  • task/none

string
"task/none"

slurmConfig.slurmCtld.timeout
The interval, in seconds, that the backup controller waits for the primary controller to respond before assuming control. The default value is 120 seconds. May not exceed 65533.

int
60

slurmConfig.slurmd.epilogConfigMap
The name or list of configmap names containing epilog scripts

string | list
[]

slurmConfig.slurmd.prologConfigMap
The name or list of configmap names containing prolog scripts

string | list
[]

slurmConfig.slurmd.suspendTime
Nodes which remain idle or down for this number of seconds will be placed into power save mode by SuspendProgram.

string
"INFINITE"

slurmConfig.slurmd.timeout
The interval, in seconds, that the Slurm controller waits for slurmd to respond before configuring that node's state to DOWN.

int
60

slurmConfig.usePersistentConnection
Use Slurm's persistent connections for connection reuse.

bool
true

slurmConfig.waitTime
Specifies how many seconds the srun command should wait after the first task terminates before terminating all remaining tasks.
Using the --wait option on the srun command line overrides this value. The default value is 0, which disables this feature. See the slurm.conf reference for more details.

int
0

sssdContainer.livenessProbe
The liveness probe for the sssd container.

object
map[]

sssdContainer.readinessProbe
The readiness probe for the sssd container.

object
map[]

sssdContainer.startupProbe
The startup probe for the sssd container.

object
map[]

syncer.annotations
Additional annotations for syncer resources.

object
{}

syncer.config.syncer.orphanedPodDelay
The delay to wait before deleting a pod that is no longer associated with a Slurm node.

string
"120s"

syncer.config.syncer.pollInterval
The polling interval for the Slurm API.

string
"10s"

syncer.config.syncer.qosInterruptable
The externally defined label to indicate if pod is interruptable.

string
"qos.coreweave.cloud/interruptable"

syncer.config.syncer.slurmNodeCleanUp
Enable cleanup of lingering Slurm nodes.

bool
false

syncer.controllerAddress
The address of the Slurm controller to connect to.
This should be the service address of the controller in host:port format.

string
""

syncer.enabled
Enable the syncer.
This is required for most functionality and should only be disabled for troubleshooting.

bool
true

syncer.hooksAPI
config for the webhooks.

object
{
"nodeRebootCondition": "PhaseState",
"nodeRebootReason": "production-powerreset",
"safeNodeRebootCondition": "PendingPhaseState",
"safeNodeRebootReason": "production-powerreset",
"waitForNodeLockedInterval": "1s",
"waitForNodeLockedTimeout": "120s"
}

syncer.hooksAPI.nodeRebootCondition
Condition to indicate node should be rebooted.

string
"PhaseState"

syncer.hooksAPI.nodeRebootReason
The target NLCC lifecycle state associated with the nodeRebootCondition.

string
"production-powerreset"

syncer.hooksAPI.safeNodeRebootCondition
Condition to indicate node should be rebooted safely.

string
"PendingPhaseState"

syncer.hooksAPI.safeNodeRebootReason
The target NLCC lifecycle state associated with the safeNodeRebootCondition.

string
"production-powerreset"

syncer.hooksAPI.waitForNodeLockedInterval
The polling interval when checking for node locked state.

string
"1s"

syncer.hooksAPI.waitForNodeLockedTimeout
The timeout for checking node locked state.

string
"120s"

syncer.image
The image to use for the syncer.

object
repository: registry.gitlab.com/coreweave/sunk/operator
tag:

syncer.labels
Additional labels for syncer resources.

object
{}

syncer.livenessProbe
The liveness probe for the syncer container.

object
httpGet:
path: /healthz
port: 8081
initialDelaySeconds: 15
periodSeconds: 20

syncer.logLevel
The log level.
Uses integers or zap log level strings:

  • debug
  • info
  • warn
  • error
  • dpanic
  • panic
  • fatal

string
"info"

syncer.maxConcurrentReconciles
The maximum concurrent reconciles.
This should be adjusted based on the number of nodes and size of jobs launched in the Slurm cluster, to handle bursts operations quickly. A value 1/10th the number of nodes in the cluster is a good starting point for small clusters. As cluster size increases, this value can be a smaller fraction of the total number of nodes in most cases. For instance a value of 50 seems to handle a 2000 node cluster well. Being too aggressive here will bottleneck on other components such as the Kubernetes API server and the Slurm controller, which in some cases may cause errors.

int
10

syncer.nodePermissions.enabled
Enable node operations on the syncer, currently this allows restart of nodes when enabled.

bool
true

syncer.priorityClassName
The priority class name for the syncer pod.

string
null

syncer.readinessProbe
The readiness probe for the syncer container.

object
httpGet:
path: /readyz
port: 8081
initialDelaySeconds: 5
periodSeconds: 10

syncer.resources
Resources for the syncer container.

object
limits:
cpu: "2"
memory: 1Gi
requests:
cpu: 200m
memory: 1Gi

syncer.startupProbe
The startup probe for the syncer container.

object
map[]

syncer.watchAllNodeSets
Watch all NodeSets in the namespace.
This overrides default behavior of only watching the NodeSets deployed with this chart release.

bool
false

syncer.watchNodeSets
The list of NodeSets to watch.
This overrides the default behavior of watching the NodeSets deployed with this chart release to instead watch this specific list. This is not used if watchAllNodeSets is set to true.

list
[]

userLookupContainer.livenessProbe
The liveness probe for the user-lookup container.

object
map[]

userLookupContainer.readinessProbe
The readiness probe for the user-lookup container.

object
map[]

userLookupContainer.startupProbe
The startup probe for the user-lookup container.

object
map[]