accounting.annotations
Additional annotations for accounting resources. | object | |
string | accounting.config.slurmdbdExtraConfig: |
ArchiveEvents=yes
ArchiveJobs=yes
ArchiveResvs=yes
ArchiveSteps=no
ArchiveSuspend=no
ArchiveTXN=no
ArchiveUsage=no
PurgeEventAfter=1month
PurgeJobAfter=12month
PurgeResvAfter=1month
PurgeStepAfter=1month
PurgeSuspendAfter=1month
PurgeTXNAfter=12month
PurgeUsageAfter=24month
|
accounting.enabled
Enable the accounting. | bool | |
accounting.external.enabled
Enable the external accounting, instead of deploying an internal accounting instance. | bool | |
accounting.external.host
The host of the external accounting instance: IP or hostname. | string | |
accounting.external.port
The port of the external accounting instance. | string | |
accounting.external.user
The user to use to authenticate to the external accounting instance. | string | |
accounting.image
The image to use for slurmdbd deployment. | object | repository: registry.gitlab.com/coreweave/sunk/controller
tag:
|
accounting.labels
Additional labels for accounting resources. | object | |
accounting.livenessProbe
The liveness probe for the slurmdbd container. | object | exec:
command:
- sacctmgr
- ping
initialDelaySeconds: 15
periodSeconds: 10
failureThreshold: 5
successThreshold: 1
|
accounting.priorityClassName
The priority class name for the accounting pod. | string | |
accounting.readinessProbe
The readiness probe for the slurmdbd container. | object | exec:
command:
- sacctmgr
- ping
initialDelaySeconds: 15
periodSeconds: 10
failureThreshold: 5
successThreshold: 1
|
accounting.replicas
The number of replicas of the accounting instance to run. | int | |
accounting.resources
Resources for the accounting container. | object | limits:
cpu: 4
memory: 16Gi
requests:
cpu: 4
memory: 16Gi
|
accounting.securityContext.runAsGroup
The group to run as, must match the slurm GID from the container image. | int | |
accounting.securityContext.runAsUser
The user to run as, must match the slurm UID from the container image. | int | |
accounting.startupProbe
The startup probe for the slurmdbd container. | object | |
accounting.terminationGracePeriodSeconds
The termination grace period for the accounting pod. | int | |
accounting.useExistingSecret
Use an existing secret for the accounting instance instead of creating. The secret name is the same as the mysql.auth.existingSecret. | bool | |
accounting.volumeMounts
Additional volume mounts to apply to the accounting pod. | list | |
accounting.volumes
Additional volumes to mount to the accounting pod. | list | |
compute.annotations
Additional annotations for compute services only. Use compute.nodes.custom-definition.annotations to add annotations to specific node definitions instead. | object | |
compute.autoPartition.enabled
Enable the auto partition. | bool | |
compute.externalClusterName
The name of an external cluster to join. This is used when control plane is deployed separately. | string | |
compute.generateTopology
Enable topology generation for the compute nodes in the cluster. | bool | |
compute.initialState
The initial state for the nodes when they join the slurm cluster. This is generally drain or idle . May also be set per node definition. | string | |
compute.initialStateReason
The reason for setting the initial state of the nodes to down, drained, or fail. May also be set per node definition. | string | "Node added to the cluster for the first time"
|
compute.labels
Additional labels for compute services only. Use compute.nodes.custom-definition.labels to add labels to specific node definitions instead. | object | |
compute.livenessProbe
The liveness probe for the compute slurmd container. | object | |
compute.maxUnavailable
The maximum unavailability of the compute nodes during a rolling update. Can be percentage or a number. | string | |
compute.nodes
Multiple node definitions can be declared, but only one may be enabled: true . Node definitions can reference other definitions to include or overlay values. See the example below or the Compute Node Definitions documentation for more details. TODO -- Update with a real example for the Open Source release.Example: click to expandcompute:
nodes:
# A custom definition to be referenced by other nodes
custom-dns:
dnsPolicy: "None"
dnsConfig:
nameservers:
- 127.0.0.1
# A simple CPU-only Node that uses the custom-dns definition above
simple-cpu:
enabled: true
replicas: 1
definitions:
# Use the custom-dns definition
- custom-dns
staticFeatures:
- cpu
dynamicFeatures:
node.coreweave.cloud/class: {}
image:
repository: registry.gitlab.com/coreweave/sunk/controller-extras
gresGpu: null
config:
weight: 1
# Create a small node with 1cpu and 1g memory
resources:
limits:
memory: 1Gi
cpu: 1
requests:
memory: 1Gi
cpu: 1
tolerations:
- key: is_cpu_compute
operator: Exists
volumeMounts:
- name: ramtmp
mountPath: /tmp
volumes:
- name: ramtmp
emptyDir:
medium: Memory
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/os
operator: In
values:
- linux
| object | See Compute Node Definitions. |
compute.partitions
Partitions to add to the cluster. | string | compute.partitions: |
PartitionName=all Nodes=ALL Default=YES MaxTime=INFINITE State=UP
|
compute.plugstackConfig
Additional plug-in stack configuration items for plugstack.conf file config. Config Options: https://slurm.schedmd.com/spank.html#SECTION_CONFIGURATION | list | |
compute.pyxis.enabled
Enable the pyxis container. | bool | |
compute.pyxis.mountHome
Enables ENROOT_MOUNT_HOME for the pyxis container to mount the home directory. | bool | |
compute.pyxis.plugstackOptions
Additional arguments for the pyxis plugin in plugstack.conf file config. Config Options: https://github.com/NVIDIA/pyxis/wiki/Setup#slurm-plugstack-configuration | list | |
compute.pyxis.remapRoot
Enables ENROOT_REMAP_ROOT for the pyxis container to remap the root user. | bool | |
compute.readinessProbe
The readiness probe for the compute slurmd container. | object | exec:
command:
- scontrol
- show
- slurmd
failureThreshold: 3
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
|
compute.s6
oneshot and longrun jobs are supported. See Running Scripts with S6 for more information.Example: click to expands6:
packages:
type: oneshot
timeoutUp: 0
timeoutDown: 0
script: |
#!/usr/bin/env bash
apt -y install nginx
nginx:
type: longrun
timeoutUp: 0
timeoutDown: 0
script: |
#!/usr/bin/env bash
nginx -g "daemon off;"
| object | |
compute.securityContext.capabilities.add
Add capabilities to the slurmd container. "SYS_ADMIN" is required if using pyxis.Example: click to expandcompute:
securityContext:
capabilities:
add: ["SYS_ADMIN"]
| list | |
compute.ssh.enabled
Enable ssh to the compute nodes. | bool | |
compute.startupProbe
The startup probe for the compute slurmd container. | object | |
compute.volumeMounts
Additional volume mounts to add to all the compute pods, also added to login pods. | list | |
compute.volumes
Additional volumes to mount to all the compute pods, also added to login pods. | list | |
controlPlane.enabled
Enable the Slurm control plane. Unless splitting the deployment this should be enabled. | bool | |
controller.annotations
Additional annotations for controller resources. | object | |
controller.enabled
Enable the controller deployment This should be enabled unless more complicated deployment is required (splitting the deployment). | bool | |
controller.image
The image to use for the controller. | object | repository: registry.gitlab.com/coreweave/sunk/controller
tag:
|
controller.jobSkipIds
The controller skips processing this list of Slurm JobIds. | list | |
controller.labels
Additional labels for controller resources. | object | |
controller.livenessProbe
The liveness probe for the controller. | object | exec:
command:
- scontrol
- ping
failureThreshold: 6
initialDelaySeconds: 15
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 10
|
controller.priorityClassName
The priority class name for the controller. | string | |
controller.readinessProbe
The readiness probe for the controller. | object | |
controller.replicas
The number of replicas of the controller to run, currently should be left at 1 . | int | |
controller.resources
Resources for the controller container. | object | limits:
cpu: 4
memory: 16Gi
requests:
cpu: 4
memory: 16Gi
|
controller.securityContext.runAsGroup
The group to run as, must match the slurm GID from the container image. | int | |
controller.securityContext.runAsUser
The user to run as, must match the slurm UID from the container image. | int | |
controller.startupProbe
The startup probe for the controller. | object | |
controller.stateVolume.size
The size of the persistent volume claim. | string | |
controller.stateVolume.storageClassName
The storage class name to use for the volume. | string | |
controller.terminationGracePeriodSeconds
The termination grace period for the controller. | int | |
controller.volumeMounts
Additional volume mounts to apply to the controller pod. | list | |
controller.volumes
Additional volumes to mount to the controller pod. | list | |
controller.watch.enabled
Enable watching the Slurm configuration and triggering a reconfigure when there are changes. | bool | |
controller.watch.interval
The interval in seconds to check for changes in the Slurm configuration. | int | |
controller.watch.livenessProbe
The liveness probe for the watch container. | object | |
controller.watch.readinessProbe
The readiness probe for the watch container. | object | |
controller.watch.startupProbe
The startup probe for the watch container. | object | |
controller.watch.topologyFileInterval
The interval in seconds to check for changes specifically in the topology.conf file. Warning: if this value is too low, scontrol reconfigure may be executed too often, especially during periods when several nodes are newly added. | int | |
directoryService.debugLevel
A bit mask of what SSSD debug levels to enable. | int | |
directoryService.directories
The directory services to configure. Click to expand examples.Google Secure LDAPdirectories:
- name: google-example.com
enabled: true
ldapUri: ldaps://ldap.google.com:636
user:
defaultShell: "/bin/bash"
fallbackHomeDir: "/home/%u"
overrideHomeDir: /mnt/nvme/home/%u
ldapsCert: google-ldaps-cert
schema: rfc2307bis
CoreWeave LDAPdirectories:
- name: coreweave.cloud
enabled: true
ldapUri: ldap://openldap
user:
bindDn: cn=admin,dc=coreweave,dc=cloud
searchBase: dc=coreweave,dc=cloud
existingSecret: bind-user-sssd-config
canary: admin
defaultShell: "/bin/bash"
fallbackHomeDir: "/home/%u"
schema: rfc2307
Authentikdirectories:
- name: coreweave.cloud
enabled: true
ldapUri: ldap://authentik-outpost-ldap-outpost
user:
bindDn: cn=ldapsvc,dc=coreweave,dc=cloud
searchBase: dc=coreweave,dc=cloud
existingSecret: bind-user-sssd-config
canary: ldapsvc
startTLS: true
userObjectClass: user
groupObjectClass: group
userNameAttr: cn
groupNameAttr: cn
schema: rfc2307bis
Active Directorydirectories:
- name: contoso.com
enabled: true
ldapUri: ldap://domaincontroller.tenant-my-tenant.coreweave.cloud
user:
bindDn: CN=binduser,CN=Users,DC=contoso,DC=com
searchBase: DC=contoso,DC=com
existingSecret: bind-user-sssd-config
canary: binduser
defaultShell: "/bin/bash"
fallbackHomeDir: "/home/%u"
schema: AD
| list | |
directoryService.directories[0].additionalConfig
Multi-line string of additional arbitrary config per domain for sssd.
Example: click to expandadditionalConfig: |
ldap_foo = bar
| string | |
directoryService.directories[0].defaultShell
The default user shell. | string | |
directoryService.directories[0].enabled
Enable the directory service. | bool | |
directoryService.directories[0].fallbackHomeDir
The fallback user home directory. | string | |
directoryService.directories[0].ignoreGroupMembers
This overrides SSSD configuration of the same name If set to true , SSSD only retrieves information about the group objects themselves and not their members, providing a significant performance boost. If omitted, defaults to true . | bool | |
directoryService.directories[0].ldapUri
The LDAP URI to use for the directory service. Example: ldap://YOUR_LDAP_IP For Google Secure LDAP, use: ldaps://ldap.google.com:636
| string | |
directoryService.directories[0].ldapsCert
Name of existing TLS certificate for LDAP-S.
Example: click to expandkubectl create secret tls google-ldaps-cert \
--cert=Google_2025_08_24_55726.crt \
--key=Google_2025_08_24_55726.key
| string | |
directoryService.directories[0].name
Name of the directory service. The primary domain should always be named: default | string | |
directoryService.directories[0].overrideGidAttr
Override the default schema LDAP attribute that corresponds to the user's primary group id. Example: posixGid | string | |
directoryService.directories[0].overrideHomeDir
Override the homeDirectory attribute from LDAP with a provided path. Example: /mnt/nvme/home/%u | string | |
directoryService.directories[0].overrideUidAttr
Override the default schema LDAP attribute that corresponds to the user's id. Example: posixUid | string | |
directoryService.directories[0].overrideUserNameAttr
Override the default schema LDAP attribute that corresponds to the user's login name. Example: employeeNumber | string | |
directoryService.directories[0].schema
The desired LDAP schema for the directory service. Valid values: Note: For Google Secure LDAP, use rfc2307bis . | string | |
directoryService.directories[0].user.bindDn
The LDAP bind DN to use for the directory service. Where bindDn is not required (e.g. Google Secure LDAP), only supply user.canary . Example: cn=Admin,ou=Users,ou=CORP,dc=corp,dc=example,dc=com | string | |
directoryService.directories[0].user.canary
The username to lookup to confirm LDAP is working. | string | |
directoryService.directories[0].user.existingSecret
Name of an existing secret containing an SSSD configuration snippet with the ldap_default_authtok set for this domain. | string | |
directoryService.directories[0].user.existingSecretFileName
The name of the file in the existing secret that contains the ldap passwords. | string | |
directoryService.directories[0].user.groupSearchBase
The LDAP group search base to use for the directory service. Example: ou=groups,dc=example,dc=com | string | |
directoryService.directories[0].user.password
The password to use for the directory service lookups. | string | |
directoryService.directories[0].user.searchBase
The LDAP search base to use for the directory service. Example: dc=corp,dc=example,dc=com | string | |
directoryService.negativeCacheTimeout
Negative caching value (in seconds). Determines how long an invalid entry will be cached before asking LDAP again. This improves the directory listing time when a primary gid cannot be found. | string | |
directoryService.sudoGroups
List of Unix groups from all directories with sudo privileges. Group names are fully-qualified for additional directories. Group names are not fully-qualified for the default directory; (e.g. "group1" instead of "[email protected]") | list | |
directoryService.watchInterval
The interval in seconds to check for changes in sssd configuration. | int | |
global.annotations
Additional annotations to apply to all resources. | object | |
global.dnsConfig.additionalSearches
A list of namespaces to add to the list of DNS searches. These additional searches extend hostname lookup in the control-plane, compute, and login pods. Default dns searches: - name-compute.namespace.svc.cluster.local - slurm_cluster_name-controller.namespace.svc.cluster.local | list | |
global.imagePullPolicy
The image pull policy for all containers. | string | |
global.labels
Additional labels to apply to the all resources. | object | |
global.nodeSelector.affinity
Sets a global affinity for all Slurm node pods. This can be overridden for specific Slurm nodes in their configuration. | object | |
global.volumeMounts
The list of volume mounts to apply to all compute, controller, accounting, and login pods | list | |
global.volumes
The list of volumes to mount to all compute, controller, accounting, and login pods | list | |
imagePullSecrets
The list of secrets used to access images in a private registry. | list | |
jwt.existingSecret
The name of an existing secret containing the JWT private key, otherwise the chart will generate one. | string | |
login.annotations
Additional annotations. | object | |
login.automountServiceAccountToken
Automatically mount the service account token into the login pod. | bool | |
login.containers
Additional sidecar containers to add to the login pod. | list | |
login.enabled
Enable the login nodes | bool | |
login.env
Additional environment variables to pass to the sshd container. | list | |
login.hostAliases
Provides Pod-level override of hostname resolution when DNS and other options are not applicable in login pods. See Adding entries to Pod /etc/hosts with HostAliases for more information. | list | |
login.image
The image to use for the login node. | object | repository: registry.gitlab.com/coreweave/sunk/controller-extras
tag:
|
login.individualResources
Resources for the slurm-login pod sshd container. | object | limits:
memory: 2Gi
requests:
cpu: 500m
memory: 300Mi
|
login.labels
Additional labels. | object | |
login.nodeSelector.affinity
The affinity for the login nodes. This overrides the value of global.nodeSelector.affinity . | object | |
login.priorityClassName
The priority class name for the login pod. | string | |
login.replicas
The number of replicas of the login node. When running more than one, a pod specific-service is created for each one in addition to the main service. | int | |
login.resources
Resources for the login node sshd container. | object | limits:
memory: 8Gi
requests:
cpu: 4
memory: 8Gi
|
login.s6
oneshot and longrun jobs are supported. See Running Scripts with S6 for more information.Example: click to expands6:
packages:
type: oneshot
timeoutUp: 0
timeoutDown: 0
script: |
#!/usr/bin/env bash
apt -y install nginx
nginx:
type: longrun
timeoutUp: 0
timeoutDown: 0
script: |
#!/usr/bin/env bash
nginx -g "daemon off;"
| object | |
login.service.additionalPorts
Additional port definitions to expose.Example: click to expandadditionalPorts:
- name: eternal-shell
port: 2022
targetPort: 20222 # optional
protocol: TCP # optional
| list | |
login.service.enabled
Enable the creation of service(s) for login pods. | bool | |
login.service.externalTrafficPolicy
The external traffic policy. | string | |
login.service.loadBalancerClass
The load balancer class to use for the login services. | string | |
login.service.metadata.0.annotations
Additional annotations to apply to the first login service (0). | object | |
login.service.metadata.0.labels
Additional labels to apply to the common first login service (0). | object | |
login.service.metadata.common.annotations
Additional annotations to apply to the common login service. | object | |
login.service.metadata.common.labels
Additional labels to apply to the common login service. | object | |
login.service.metadata.global.annotations
Additional annotations to apply to all login services. | object | |
login.service.metadata.global.labels
Additional labels to apply to all login services. | object | |
login.service.type
The type of service to create. This defaults to LoadBalancer for cloud deployments. For development and test systems without an external load balancer to handle the service routing, such as when deploying on kind (Kubernetes IN Docker), this may be set to ClusterIP . | string | |
login.serviceAccountName
The service account name to use for the login pod. | string | |
login.sshKeyVolume.accessModes
The access mode for the storage. If scaling login beyond 1 replica, this must be ReadWriteMany . In a development setting with a volume provider that doesn't support ReadWriteMany , such as kind (Kubernetes IN Docker), this may be set to ReadWriteOnce . | string | |
login.sshKeyVolume.enabled
Enable the ssh key volume, to allow keys to be mounted and persisted in the login pod. If this is disabled the host keys for the login pod will be regenerated on each container restart. | bool | |
login.sshKeyVolume.size
The size of the persistent volume claim. | string | |
login.sshKeyVolume.storageClassName
The storage class name to use for the volume. | string | |
login.sshdConfig
Additional sshd configuration to add to the login pod.Example: click to expandsshdConfig: |
PasswordAuthentication no
| string | |
login.sshdLivenessProbe.config
The liveness probe for the login sshd container. | object | failureThreshold: 10
initialDelaySeconds: 10
periodSeconds: 5
tcpSocket:
port: 22
|
login.sshdLivenessProbe.enabled
If the liveness probe for the login sshd is enabled | bool | |
login.sshdReadinessProbe.config
The readiness probe for the login sshd container. | object | |
login.sshdReadinessProbe.enabled
If the readiness probe for the login sshd container is enabled | bool | |
login.sshdStartupProbe.config
The startup probe for the login sshd container. | object | |
login.sshdStartupProbe.enabled
If the startup probe for the login sshd container is enabled | bool | |
login.terminationGracePeriodSeconds
The termination grace period for the login pod. | int | |
login.updateStrategy
The update strategy for the login node- Default is type is RollingUpdate | object | |
login.volumeMounts
Additional volume mounts to apply to the login pod. | list | |
login.volumes
Additional volumes to add to the login pod.Example: click to expandvolumes:
- name: cache-vol
emptyDir:
medium: Memory
| list | |
munge.args
The additional arguments to pass to the munge container. The defaults run Munge with 10 threads instead of 2. | list | [
"--num-threads",
"10"
]
|
munge.livenessProbe
The liveness probe for the munge container. | object | |
munge.readinessProbe
The readiness probe for the munge container. | object | |
munge.resources
Resources for the munge container. | object | limits:
cpu: 1
memory: 2Gi
|
munge.securityContext.runAsGroup
The group to run as, must match the munge GID from the container image. | int | |
munge.securityContext.runAsUser
The user to run as, must match the munge UID from the container image. | int | |
munge.startupProbe
The startup probe for the munge container. | object | |
mysql
Options for Bitnami MySQL chart, uses Bitnami default values. | object | See Bitnami default values. |
rest.annotations
Additional annotations for REST API resources. | object | |
rest.args
The additional arguments to pass to the rest container. Defaults enable debug logging and only load most recent openAPI plugins. | list | [
"-vv",
"-sslurmdbd,slurmctld",
"-dv0.0.40"
]
|
rest.enabled
Enable the REST API deployment This is optional and should be disabled for most use cases. | bool | |
rest.env
The additional environment variables to pass to the rest container. | list | [
{
"name": "SLURMRESTD_JSON",
"value": "compact"
}
]
|
rest.image
The image to use for the REST API deployment. | object | repository: registry.gitlab.com/coreweave/sunk/controller
tag:
|
rest.labels
Additional labels for REST API resources. | object | |
rest.livenessProbe
The liveness probe for the rest container. | object | tcpSocket:
port: slurmrestd
failureThreshold: 2
periodSeconds: 10
|
rest.priorityClassName
The priority class name for the rest pod. | string | |
rest.readinessProbe
The readiness probe for the rest container. | object | tcpSocket:
port: slurmrestd
periodSeconds: 5
failureThreshold: 1
|
rest.replicas
The number of replicas of the rest pod to run. In most production environments this should be set to a minimum of 2 to provide HA. | int | |
rest.resources
Resources for the slurmrestd container. These defaults are appropriate for small and medium-sized clusters. | object | limits:
cpu: 2
memory: 8Gi
requests:
cpu: 2
memory: 8Gi
|
rest.securityContext.runAsGroup
The group to run as, must match the slurm GID from the container image. | int | |
rest.securityContext.runAsUser
The user to run as, must match the slurm UID from the container image. | int | |
rest.startupProbe
The startup probe for the rest container. | object | tcpSocket:
port: slurmrestd
failureThreshold: 20
periodSeconds: 2
|
rest.terminationGracePeriodSeconds
The termination grace period for the rest pod. | int | |
scheduler.annotations
Additional annotations for scheduler resources. | object | |
scheduler.config.scheduler.gpuTypes
Mapping of k8s gpu types to Slurm gpu types. The keys represent GPU types required during scheduling from the node affinity using the key "gpu.nvidia.com/class" and the values represent the gres gpu type in Slurm. This gets added to a job's description. | map | {
"A100_NVLINK_80GB": "a100",
"H100_NVLINK_80GB": "h100"
}
|
scheduler.config.scheduler.pollInterval
The polling interval for the Slurm API. | string | |
scheduler.config.scheduler.terminationOffset
offset termination grace period to account for communication delays etc. | string | |
scheduler.controllerAddress
The address of the Slurm controller to connect to. This should be the service address of the controller in host:port format. | string | |
scheduler.enabled
Enable the scheduler. To schedule k8s pods on the Slurm cluster nodes, this must be enabled. | bool | |
scheduler.hooksAPI
config for the webhooks. | object | {
"waitForPodDeletionInterval": "1s"
}
|
scheduler.hooksAPI.waitForPodDeletionInterval
The polling interval when checking for pod deletion. | string | |
scheduler.image
The image to use for the scheduler. | object | repository: registry.gitlab.com/coreweave/sunk/operator
tag:
|
scheduler.labels
Additional labels for scheduler resources. | object | |
scheduler.livenessProbe
The liveness probe for the scheduler container. | object | httpGet:
path: /healthz
port: 8081
initialDelaySeconds: 15
periodSeconds: 20
|
scheduler.logLevel
The log level. Uses integers or zap log level strings: debug info warn error dpanic panic fatal | string | |
scheduler.maxConcurrentReconciles
The maximum concurrent reconciles. This should be adjusted based on the volume of pods using the scheduler to handle bursts operations quickly. The size of both the Slurm and Kubernetes clusters will impact this but less than the syncer. The driving factor here tends to be the pod volume and associated Slurm jobs more than anything else. Using the same value as the syncer should be a rather conservative starting point in many use cases. | int | |
scheduler.name
The name of the scheduler used to select the scheduler during pod creation. By default the name is based on the namespace and release name <namespace>-<release>-scheduler when not set. | string | |
scheduler.priorityClassName
The priority class name for the scheduler pod. | string | |
scheduler.readinessProbe
The readiness probe for the scheduler container. | object | httpGet:
path: /readyz
port: 8081
initialDelaySeconds: 5
periodSeconds: 10
|
scheduler.resources
Resources for the scheduler container. | object | limits:
cpu: "1"
memory: 1Gi
requests:
cpu: 200m
memory: 1Gi
|
scheduler.scope.namespaces
The list of the namespaces to scope the scheduler to. Only used when scope.type is set to namespace . Namespaces other than the release namespace will need role bindings created. | list | |
scheduler.scope.type
The type can be cluster or namespace . | string | |
scheduler.startupProbe
The startup probe for the scheduler container. | object | |
secretJob.annotations
Additional annotations for secret Job resources. | object | |
secretJob.labels
Additional labels for secret Job resources. | object | |
secretJob.nodeSelector.affinity
The affinity for the secret job. This overrides the value of global.nodeSelector.affinity . | object | |
secretJob.priorityClassName
The priority class name for the secret job pod. | string | |
secretJob.resources
Resources for the secret job container. | object | limits:
cpu: 500m
memory: 500Mi
requests:
cpu: 200m
memory: 100Mi
|
secretJob.tolerations
The tolerations for the secret job | list | |
slurm-login
Configure individual login nodes via slurm-login subchart. Below is an example showing some of the key parameters of the subchart, see subchart docs for all parameters.Example: click to expandslurm-login:
enable: true
directoryCache:
# select users from two groups
selectGroups: ["slum-researches", "slurm-ops"]
# poll every minutes (default 90s)
interval: 1m
# Google Secure LDAP
directoryService:
directories:
- name: google-example.com
enabled: true
ldapUri: ldaps://ldap.google.com:636
user:
defaultShell: "/bin/bash"
fallbackHomeDir: "/home/%u"
overrideHomeDir: /mnt/nvme/home/%u
ldapsCert: google-ldaps-cert
schema: rfc2307bis
| object | See default values in slurm-login subchart. |
slurmConfig.cgroupConfig
The cgroup.conf value. This is only used when procTrackType is set to proctrack/cgroup . | string | slurmConfig.cgroupConfig: |
CgroupPlugin=autodetect
IgnoreSystemd=yes
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes
|
slurmConfig.defMemPerCPU
The default memory per CPU in megabytes. Sets the slurm.conf parameter of the default real memory size available per usable allocated CPU in megabytes. This value is used when the --mem-per-cpu option is not specified on the srun command line. | int | |
string | |
slurmConfig.inactiveLimit
Terminate job allocation commands, such as srun or salloc , that are unresponsive longer than this interval in seconds. See the slurm.conf reference for more details. | int | |
slurmConfig.killWait
The interval in seconds between the SIGTERM and SIGKILL signals given to a job's processes upon reaching its time limit. See the slurm.conf reference for more details. | int | |
slurmConfig.poolSize
The number of connections to be maintained in the connection pool. | int | |
slurmConfig.protocolVersion
The protocol version to use for communication with the Slurm controller. | string | |
slurmConfig.selectTypeParameters
The values to use for the parameters of the select/cons_tres plugin. Allowed values depend on the configured value of SelectType . See the slurm.conf reference for more details. | string | |
slurmConfig.slurmCtld.accountingStorageEnforce
Controls what level of association-based enforcement to impose on job submissions. Multiple comma-separated values allowed. Valid options are any combination of: associations limits nojobs nosteps qos safe wckeys Use all to impose everything except nojobs and nosteps , which must be requested separately. See the Slurm documentation for more details. | string | |
slurmConfig.slurmCtld.additionalParameters
The list of additional parameters to pass to slurmCtld. See the Slurm documentation for possible values. | list | - idle_on_node_suspend
- node_reg_mem_percent=95
|
slurmConfig.slurmCtld.etcConfigMap
A ConfigMap with keys mapping to files in /etc/slurm on the controller only. This ConfigMap must not contain: slurm.conf plugstack.conf gres.conf cgroup.conf topology.conf | string | |
slurmConfig.slurmCtld.jobSubmitPlugins
The job submit plugins to use. Multiple comma-separated values allowed. | string | |
slurmConfig.slurmCtld.procTrackType
The plugin to be used for process tracking on a job step basis. See the Slurm documentation for more details. Valid values: proctrack/linuxproc proctrack/cgroup | string | |
slurmConfig.slurmCtld.taskPlugin
The task plugin to use. See the Slurm documentation for more details. Multiple comma-separated values allowed. Valid values: task/affinity task/cgroup task/none | string | |
slurmConfig.slurmCtld.timeout
The interval, in seconds, that the backup controller waits for the primary controller to respond before assuming control. The default value is 120 seconds. May not exceed 65533. | int | |
slurmConfig.slurmd.epilogConfigMap
The name or list of configmap names containing epilog scripts | string | list | |
slurmConfig.slurmd.prologConfigMap
The name or list of configmap names containing prolog scripts | string | list | |
slurmConfig.slurmd.suspendTime
Nodes which remain idle or down for this number of seconds will be placed into power save mode by SuspendProgram. | string | |
slurmConfig.slurmd.timeout
The interval, in seconds, that the Slurm controller waits for slurmd to respond before configuring that node's state to DOWN. | int | |
slurmConfig.usePersistentConnection
Use Slurm's persistent connections for connection reuse. | bool | |
slurmConfig.waitTime
Specifies how many seconds the srun command should wait after the first task terminates before terminating all remaining tasks. Using the --wait option on the srun command line overrides this value. The default value is 0 , which disables this feature. See the slurm.conf reference for more details. | int | |
sssdContainer.livenessProbe
The liveness probe for the sssd container. | object | |
sssdContainer.readinessProbe
The readiness probe for the sssd container. | object | |
sssdContainer.startupProbe
The startup probe for the sssd container. | object | |
syncer.annotations
Additional annotations for syncer resources. | object | |
syncer.config.syncer.orphanedPodDelay
The delay to wait before deleting a pod that is no longer associated with a Slurm node. | string | |
syncer.config.syncer.pollInterval
The polling interval for the Slurm API. | string | |
syncer.config.syncer.qosInterruptable
The externally defined label to indicate if pod is interruptable. | string | "qos.coreweave.cloud/interruptable"
|
syncer.config.syncer.slurmNodeCleanUp
Enable cleanup of lingering Slurm nodes. | bool | |
syncer.controllerAddress
The address of the Slurm controller to connect to. This should be the service address of the controller in host:port format. | string | |
syncer.enabled
Enable the syncer. This is required for most functionality and should only be disabled for troubleshooting. | bool | |
syncer.hooksAPI
config for the webhooks. | object | {
"nodeRebootCondition": "PhaseState",
"nodeRebootReason": "production-powerreset",
"safeNodeRebootCondition": "PendingPhaseState",
"safeNodeRebootReason": "production-powerreset",
"waitForNodeLockedInterval": "1s",
"waitForNodeLockedTimeout": "120s"
}
|
syncer.hooksAPI.nodeRebootCondition
Condition to indicate node should be rebooted. | string | |
syncer.hooksAPI.nodeRebootReason
The target NLCC lifecycle state associated with the nodeRebootCondition . | string | |
syncer.hooksAPI.safeNodeRebootCondition
Condition to indicate node should be rebooted safely. | string | |
syncer.hooksAPI.safeNodeRebootReason
The target NLCC lifecycle state associated with the safeNodeRebootCondition . | string | |
syncer.hooksAPI.waitForNodeLockedInterval
The polling interval when checking for node locked state. | string | |
syncer.hooksAPI.waitForNodeLockedTimeout
The timeout for checking node locked state. | string | |
syncer.image
The image to use for the syncer. | object | repository: registry.gitlab.com/coreweave/sunk/operator
tag:
|
syncer.labels
Additional labels for syncer resources. | object | |
syncer.livenessProbe
The liveness probe for the syncer container. | object | httpGet:
path: /healthz
port: 8081
initialDelaySeconds: 15
periodSeconds: 20
|
syncer.logLevel
The log level. Uses integers or zap log level strings: debug info warn error dpanic panic fatal | string | |
syncer.maxConcurrentReconciles
The maximum concurrent reconciles. This should be adjusted based on the number of nodes and size of jobs launched in the Slurm cluster, to handle bursts operations quickly. A value 1/10th the number of nodes in the cluster is a good starting point for small clusters. As cluster size increases, this value can be a smaller fraction of the total number of nodes in most cases. For instance a value of 50 seems to handle a 2000 node cluster well. Being too aggressive here will bottleneck on other components such as the Kubernetes API server and the Slurm controller, which in some cases may cause errors. | int | |
syncer.nodePermissions.enabled
Enable node operations on the syncer, currently this allows restart of nodes when enabled. | bool | |
syncer.priorityClassName
The priority class name for the syncer pod. | string | |
syncer.readinessProbe
The readiness probe for the syncer container. | object | httpGet:
path: /readyz
port: 8081
initialDelaySeconds: 5
periodSeconds: 10
|
syncer.resources
Resources for the syncer container. | object | limits:
cpu: "2"
memory: 1Gi
requests:
cpu: 200m
memory: 1Gi
|
syncer.startupProbe
The startup probe for the syncer container. | object | |
syncer.watchAllNodeSets
Watch all NodeSets in the namespace. This overrides default behavior of only watching the NodeSets deployed with this chart release. | bool | |
syncer.watchNodeSets
The list of NodeSets to watch. This overrides the default behavior of watching the NodeSets deployed with this chart release to instead watch this specific list. This is not used if watchAllNodeSets is set to true. | list | |
userLookupContainer.livenessProbe
The liveness probe for the user-lookup container. | object | |
userLookupContainer.readinessProbe
The readiness probe for the user-lookup container. | object | |
userLookupContainer.startupProbe
The startup probe for the user-lookup container. | object | |