Documentation Index
Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
Use this file to discover all available pages before exploring further.
Requirements
| Repository | Name | Version |
|---|---|---|
| file://../library | library | 0.1.0 |
| file://../slurm-login | slurm-login | 0.1.0 |
| oci://registry-1.docker.io/bitnamicharts | mysql | 9.19.1 |
Parameters
| Key & Description | Type | Default |
|---|---|---|
| accounting.annotations Additional annotations for accounting resources. | object | |
| accounting.config.ArchiveEvents | string | |
| accounting.config.ArchiveJobs | string | |
| accounting.config.ArchiveResvs | string | |
| accounting.config.ArchiveSteps | string | |
| accounting.config.ArchiveSuspend | string | |
| accounting.config.ArchiveTXN | string | |
| accounting.config.ArchiveUsage | string | |
| accounting.config.AuthAltParameters[0] | string | |
| accounting.config.AuthAltTypes | string | |
| accounting.config.AuthType | string | |
| accounting.config.DbdPort | int | |
| accounting.config.DebugLevel | string | |
| accounting.config.LogFile | string | |
| accounting.config.PidFile | string | |
| accounting.config.PurgeEventAfter | string | |
| accounting.config.PurgeJobAfter | string | |
| accounting.config.PurgeResvAfter | string | |
| accounting.config.PurgeStepAfter | string | |
| accounting.config.PurgeSuspendAfter | string | |
| accounting.config.PurgeTXNAfter | string | |
| accounting.config.PurgeUsageAfter | string | |
| accounting.config.SlurmUser | string | |
| accounting.config.StoragePort | int | |
| accounting.config.StorageType | string | |
| accounting.enabled Enable the accounting. | bool | |
| accounting.external.enabled Enable the external accounting, instead of deploying an internal accounting instance. This configuration also requires the underlying database for slurmdbd to be managed externally. | bool | |
| accounting.external.host The host of the external accounting instance: IP or hostname. | string | |
| accounting.external.port The port of the external accounting instance. | string | |
| accounting.external.user The user to use to authenticate to the external accounting instance. | string | |
| accounting.externalDB.enabled Configure Slurm Accounting ( slurmdbd) with an external database. | bool | |
| accounting.externalDB.existingSecret Specify the name of the Kubernetes Secret that contains the password used by slurmdbd to access the Slurm accounting database. This Secret must contain a data key named db-password whose value is the actual database password. Important: The password value stored in the Secret cannot contain the hash (#) character. | string | |
| accounting.externalDB.storageHost The hostname of the server where the database resides. | string | |
| accounting.externalDB.storageLoc The name of the database used to store Slurm accounting records. Defaults to “slurm_acct_db”. | string | |
| accounting.externalDB.storageUser The username slurmdbd uses for authentication and storing job accounting data. | string | |
| accounting.image The image to use for slurmdbd deployment. | object | |
| accounting.labels Additional labels for accounting resources. | object | |
| accounting.livenessProbe The liveness probe for the slurmdbd container. | object | |
| accounting.priorityClassName The priority class name for the accounting pod. | string | |
| accounting.readinessProbe The readiness probe for the slurmdbd container. | object | |
| accounting.replicas The number of replicas of the accounting instance to run. | int | |
| accounting.resources Resources for the accounting container. | object | |
| accounting.securityContext.runAsGroup The group to run as, must match the slurm GID from the container image. | int | |
| accounting.securityContext.runAsUser The user to run as, must match the slurm UID from the container image. | int | |
| accounting.startupProbe The startup probe for the slurmdbd container. | object | |
| accounting.terminationGracePeriodSeconds The termination grace period for the accounting pod. | int | |
| accounting.useExistingSecret Use an existing secret for the accounting instance instead of creating. The secret name is the same as the mysql.auth.existingSecret. | bool | |
| accounting.volumeMounts Additional volume mounts to apply to the accounting pod. | list | |
| accounting.volumes Additional volumes to mount to the accounting pod. | list | |
| cleanupCompleting.annotations Additional annotations for cleanup-completing Job resources. | object | |
| cleanupCompleting.cronJobSchedule The schedule for the cleanup-completing CronJob. It should be formatted according to the cron convention. Default runs every minute. | string | |
| cleanupCompleting.deleteInvalidNodes Enable deletion of nodes that are in INVALID_REG state after downing them. This allows nodes to cleanly re-register with Slurm. | bool | |
| cleanupCompleting.dryRun Enable dry run mode - shows what would be done without actually downing nodes. | bool | |
| cleanupCompleting.enabled Enable cleanup of nodes with jobs stuck in COMPLETING state. | bool | |
| cleanupCompleting.labels Additional labels for cleanup-completing Job resources. | object | |
| cleanupCompleting.nodeSelector.affinity The affinity for the cleanup-completing Job. This overrides the value of global.nodeSelector.affinity. | object | |
| cleanupCompleting.priorityClassName The priority class name for the cleanup-completing Job pod. | string | |
| cleanupCompleting.resources Resources for the cleanup-completing Job container. | object | |
| cleanupCompleting.timeoutSeconds Timeout in seconds for jobs in COMPLETING state before downing nodes. Jobs that have been completing longer than this threshold will trigger node downing (if no other jobs are present on the node). If not specified, defaults to 2x KillWait from slurm.conf. | int | |
| cleanupCompleting.tolerations The tolerations for the cleanup-completing Job | list | |
| cleanupCompleting.verbose Enable verbose logging for debugging. | bool | |
| compute.annotations Additional annotations for compute services only. Use compute.nodes.custom-definition.annotations to add annotations to specific node definitions instead. | object | |
| compute.autoPartition.config The following are intended for the customer to update. These values will be applied to each auto-generated partition. The partition name will be the same as the node definition name.
Example: click to expand | object | |
| compute.autoPartition.enabled Enable the auto partition. | bool | |
| compute.cacheDropper.enabled An option to enable or disable the cache-dropper sidecar container across all slurmd pods. | bool | |
| compute.cacheDropper.resources Resources for the cache-dropper sidecar container. | object | |
| compute.epilogConfigMap The name or list of configmap names containing epilog scripts | string | list | |
| compute.externalClusterName The name of an external cluster to join. This is used when control plane is deployed separately. | string | |
| compute.generateTopology Enable topology generation for the compute nodes in the cluster. | bool | |
| compute.gpusd Configuration for GPUSD (GPU Straggler Detection) metrics collection. | object | See individual settings below. |
| compute.gpusd.enabled Enable GPUSD package installation, metrics collection, and VMPodScrape resource. | bool | |
| compute.gpusd.version GPUSD version to install. | string | |
| compute.initialState The initial state for the nodes when they join the slurm cluster. This is generally drain or idle. May also be set per node definition. | string | |
| compute.initialStateReason The reason for setting the initial state of the nodes to down, drained, or fail. May also be set per node definition. | string | |
| compute.labels Additional labels for compute services only. Use compute.nodes.custom-definition.labels to add labels to specific node definitions instead. | object | |
| compute.livenessProbe The liveness probe for the compute slurmd container. | object | |
| compute.maxUnavailable The maximum unavailability of the compute nodes during a rolling update. Can be percentage or a number. | string | |
| compute.nodes Multiple node definitions can be declared, but only one may be enabled: true. Node definitions can reference other definitions to include or overlay values. See the example below or the Compute Node Definitions documentation for more details.
Example: click to expand | object | See Compute Node Definitions. |
| compute.partitionBaseConfig Default configuration for partitions in the cluster. These values can be overridden per partition in the autoPartition section.
Example: click to expand | object | |
| compute.partitions Partitions to add to the cluster. The key is the partition name and the value is the partition configuration. | object | |
| compute.plugstackConfig Additional plug-in stack configuration items for plugstack.conf file config. Config Options: https://slurm.schedmd.com/spank.html#SECTION_CONFIGURATION | list | |
| compute.ports Additional ports to expose on the compute nodes.
Example: NCCL Plugin ports | list | |
| compute.prologConfigMap The name or list of configmap names containing prolog scripts | string | list | |
| compute.pyxis.appArmorProfile The AppArmor profile to use for the pyxis container. | string | |
| compute.pyxis.enabled Enable the pyxis container. | bool | |
| compute.pyxis.enrootConfig Configuration options for enroot. | object | |
| compute.pyxis.plugstackOptions Additional arguments for the pyxis plugin in plugstack.conf file config. Config Options: https://github.com/NVIDIA/pyxis/wiki/Setup#slurm-plugstack-configuration | list | |
| compute.pyxis.podSecurityContext Security context for the pyxis container. | object | |
| compute.pyxis.podSecurityContext.seccompProfile The seccomp profile to use for the pyxis container. | object | |
| compute.readinessProbe The readiness probe for the compute slurmd container. | object | |
| compute.reservedMemory Reserved memory when calculating DefMemPerCPU config for slurm.conf | string | |
| compute.s6 oneshot and longrun jobs are supported. See Running Scripts with S6 for more information.
Example: click to expand | object | |
| compute.securityContext.capabilities.add Add capabilities to the slurmd container. “SYS_ADMIN” is required if using pyxis.
Example: click to expand | list | |
| compute.ssh.enabled Enable ssh to the compute nodes. | bool | |
| compute.startupProbe The startup probe for the compute slurmd container. | object | |
| compute.volumeMounts Additional volume mounts to add to all the compute pods, also added to login pods. | list | |
| compute.volumes Additional volumes to mount to all the compute pods, also added to login pods. | list | |
| controlPlane.enabled Enable the Slurm control plane. Unless splitting the deployment this should be enabled. | bool | |
| controller.annotations Additional annotations for controller resources. | object | |
| controller.enabled Enable the controller deployment This should be enabled unless more complicated deployment is required (splitting the deployment). | bool | |
| controller.etcConfigMap The ConfigMap(s) with keys mapping to files in /etc/slurm on the controller only. This ConfigMap must not contain:
| string | list | |
| controller.image The image to use for the controller. | object | |
| controller.labels Additional labels for controller resources. | object | |
| controller.livenessProbe The liveness probe for the controller. | object | |
| controller.priorityClassName The priority class name for the controller. | string | |
| controller.readinessProbe The readiness probe for the controller. | object | |
| controller.replicas The number of replicas of the controller to run, currently should be left at 1. | int | |
| controller.resources Resources for the controller container. | object | |
| controller.securityContext.runAsGroup The group to run as, must match the slurm GID from the container image. | int | |
| controller.securityContext.runAsUser The user to run as, must match the slurm UID from the container image. | int | |
| controller.startupProbe The startup probe for the controller. | object | |
| controller.stateVolume.size The size of the persistent volume claim. | string | |
| controller.stateVolume.storageClassName The storage class name to use for the volume. | string | |
| controller.terminationGracePeriodSeconds The termination grace period for the controller. | int | |
| controller.volumeMounts Additional volume mounts to apply to the controller pod. | list | |
| controller.volumes Additional volumes to mount to the controller pod. | list | |
| controller.watch.enabled Enable watching the Slurm configuration and triggering a reconfigure when there are changes. | bool | |
| controller.watch.interval The interval in seconds to check for changes in the Slurm configuration. | int | |
| controller.watch.livenessProbe The liveness probe for the watch container. | object | |
| controller.watch.readinessProbe The readiness probe for the watch container. | object | |
| controller.watch.startupProbe The startup probe for the watch container. | object | |
| directoryService.debugLevel A bit mask of what SSSD debug levels to enable. | int | |
| directoryService.directories The directory services to configure. Click to expand examples.
Google Secure LDAP
CoreWeave LDAP
Authentik
Active Directory | list | |
| directoryService.directories[0].additionalConfig Multi-line string of additional arbitrary config per domain for sssd.
Example: click to expand | string | |
| directoryService.directories[0].defaultShell The default user shell. | string | |
| directoryService.directories[0].enabled Enable the directory service. | bool | |
| directoryService.directories[0].fallbackHomeDir The fallback user home directory. | string | |
| directoryService.directories[0].ignoreGroupMembers This overrides SSSD configuration of the same name If set to true, SSSD only retrieves information about the group objects themselves and not their members, providing a significant performance boost. If omitted, defaults to true. | bool | |
| directoryService.directories[0].ldapUri The LDAP URI to use for the directory service. Example: ldap://YOUR_LDAP_IP For Google Secure LDAP, use: ldaps://ldap.google.com:636 | string | |
| directoryService.directories[0].ldapsCert Name of existing TLS certificate for LDAP-S.
Example: click to expand | string | |
| directoryService.directories[0].name Name of the directory service. The primary domain should always be named: default | string | |
| directoryService.directories[0].overrideGidAttr Override the default schema LDAP attribute that corresponds to the user’s primary group id. Example: posixGid | string | |
| directoryService.directories[0].overrideHomeDir Override the homeDirectory attribute from LDAP with a provided path. Example: /mnt/nvme/home/%u | string | |
| directoryService.directories[0].overrideUidAttr Override the default schema LDAP attribute that corresponds to the user’s id. Example: posixUid | string | |
| directoryService.directories[0].overrideUserNameAttr Override the default schema LDAP attribute that corresponds to the user’s login name. Example: employeeNumber | string | |
| directoryService.directories[0].schema The desired LDAP schema for the directory service. Valid values:
rfc2307bis. | string | |
| directoryService.directories[0].user.bindDn The LDAP bind DN to use for the directory service. Where bindDn is not required (e.g. Google Secure LDAP), only supply user.canary. Example: cn=Admin,ou=Users,ou=CORP,dc=corp,dc=example,dc=com | string | |
| directoryService.directories[0].user.canary The username to lookup to confirm LDAP is working. | string | |
| directoryService.directories[0].user.existingSecret Name of an existing secret containing an SSSD configuration snippet with the ldap_default_authtok set for this domain. | string | |
| directoryService.directories[0].user.existingSecretFileName The name of the file in the existing secret that contains the ldap passwords. | string | |
| directoryService.directories[0].user.groupSearchBase The LDAP group search base to use for the directory service. Example: ou=groups,dc=example,dc=com | string | |
| directoryService.directories[0].user.password The password to use for the directory service lookups. | string | |
| directoryService.directories[0].user.searchBase The LDAP search base to use for the directory service. Example: dc=corp,dc=example,dc=com | string | |
| directoryService.negativeCacheTimeout Negative caching value (in seconds). Determines how long an invalid entry will be cached before asking LDAP again. This improves the directory listing time when a primary gid cannot be found. | string | |
| directoryService.sudoGroups List of Unix groups from all directories with sudo privileges. Group names are fully-qualified for additional directories. Group names are not fully-qualified for the default directory; (e.g. “group1” instead of “group1@domain.com”) | list | |
| directoryService.watchInterval The interval in seconds to check for changes in sssd configuration. | int | |
| global.annotations Additional annotations to apply to all resources. | object | |
| global.cks Enable CoreWeave Kubernetes Services (CKS) integration. | bool | |
| global.dnsConfig.additionalSearches A list of namespaces to add to the list of DNS searches. These additional searches extend hostname lookup in the control-plane, compute, and login pods. Default dns searches: - name-compute.namespace.svc.cluster.local - slurm_cluster_name-controller.namespace.svc.cluster.local | list | |
| global.imagePullPolicy The image pull policy for all containers. | string | |
| global.labels Additional labels to apply to the all resources. | object | |
| global.nodeSelector.affinity The affinity for the Slurm control-plane components. | object | |
| global.volumeMounts The list of volume mounts to apply to all compute, controller, accounting, and login pods | list | |
| global.volumes The list of volumes to mount to all compute, controller, accounting, and login pods | list | |
| imagePullSecrets The list of secrets used to access images in a private registry. | list | |
| jwt.existingSecret The name of an existing secret containing the JWT private key, otherwise the chart will generate one. | string | |
| login.annotations Additional annotations. | object | |
| login.automountServiceAccountToken Automatically mount the service account token into the login pod. | bool | |
| login.containers Additional sidecar containers to add to the login pod. | list | |
| login.enabled Enable the login nodes | bool | |
| login.env Additional environment variables to pass to the sshd container. | list | |
| login.hostAliases Provides Pod-level override of hostname resolution when DNS and other options are not applicable in login pods. See Adding entries to Pod /etc/hosts with HostAliases for more information. | list | |
| login.image The image to use for the login node. | object | |
| login.individualResources Resources for the slurm-login pod sshd container. | object | |
| login.labels Additional labels. | object | |
| login.nodeSelector.affinity The affinity for the login nodes. This overrides the value of global.nodeSelector.affinity. | object | |
| login.priorityClassName The priority class name for the login pod. | string | |
| login.replicas The number of replicas of the login node. When running more than one, a pod specific-service is created for each one in addition to the main service. | int | |
| login.resources.limits.memory | string | |
| login.resources.requests.cpu | int | |
| login.resources.requests.memory | string | |
| login.s6 oneshot and longrun jobs are supported. See Running Scripts with S6 for more information.
Example: click to expand | object | |
| login.service.additionalPorts Additional port definitions to expose.
Example: click to expand | list | |
| login.service.enabled Enable the creation of service(s) for login pods. | bool | |
| login.service.externalTrafficPolicy The external traffic policy. | string | |
| login.service.loadBalancerClass The load balancer class to use for the login services. | string | |
| login.service.metadata.0.annotations Additional annotations to apply to the first login service (0). | object | |
| login.service.metadata.0.labels Additional labels to apply to the common first login service (0). | object | |
| login.service.metadata.common.annotations Additional annotations to apply to the common login service. | object | |
| login.service.metadata.common.labels Additional labels to apply to the common login service. | object | |
| login.service.metadata.global.annotations Additional annotations to apply to all login services. | object | |
| login.service.metadata.global.labels Additional labels to apply to all login services. | object | |
| login.service.type The type of service to create. This defaults to LoadBalancer for cloud deployments. For development and test systems without an external load balancer to handle the service routing, such as when deploying on kind (Kubernetes IN Docker), this may be set to ClusterIP. | string | |
| login.serviceAccountName The service account name to use for the login pod. | string | |
| login.sshKeyVolume.accessModes The access mode for the storage. If scaling login beyond 1 replica, this must be ReadWriteMany. In a development setting with a volume provider that doesn’t support ReadWriteMany, such as kind (Kubernetes IN Docker), this may be set to ReadWriteOnce. | string | |
| login.sshKeyVolume.enabled Enable the ssh key volume, to allow keys to be mounted and persisted in the login pod. If this is disabled the host keys for the login pod will be regenerated on each container restart. | bool | |
| login.sshKeyVolume.size The size of the persistent volume claim. | string | |
| login.sshKeyVolume.storageClassName The storage class name to use for the volume. | string | |
| login.sshdConfig Additional sshd configuration to add to the login pod.
Example: click to expand | string | |
| login.sshdLivenessProbe.config The liveness probe for the login sshd container. | object | |
| login.sshdLivenessProbe.enabled If the liveness probe for the login sshd is enabled | bool | |
| login.sshdReadinessProbe.config The readiness probe for the login sshd container. | object | |
| login.sshdReadinessProbe.enabled If the readiness probe for the login sshd container is enabled | bool | |
| login.sshdStartupProbe.config The startup probe for the login sshd container. | object | |
| login.sshdStartupProbe.enabled If the startup probe for the login sshd container is enabled | bool | |
| login.terminationGracePeriodSeconds The termination grace period for the login pod. | int | |
| login.updateStrategy The update strategy for the login node- Default is type is RollingUpdate | object | |
| login.volumeMounts Additional volume mounts to apply to the login pod. | list | |
| login.volumes Additional volumes to add to the login pod.
Example: click to expand | list | |
| moco Options for MOCO MySQL used for Slurm job accounting. | object | See individual settings below. |
| moco.enabled Enable moco. | bool | |
| moco.migration.enabled When enabled: true, a Kubernetes Job is created to perform the migration of the Slurm accounting database to MOCO MySQL. This job runs once and then completes. Any existing Slurm accounting database in bitnami MySQL database will be migrated to the MOCO MySQL database. This should be set to true only for the initial migration, and then set to false afterwards to bring the cluster back to normal operation. During this automated migration, the Slurm cluster will not be in a functional state. | bool | |
| moco.mysqlCluster.affinity Optional pod affinity configuration | object | |
| moco.mysqlCluster.auth.existingSecret Optional, will use randomly generated moco WRITABLE_PASSWORD if not set Specify the name of the Kubernetes Secret that contains the password used by slurmdbd to access the MOCO MySQL database. This Secret must contain a data key named WRITABLE_PASSWORD whose value is the actual database password. | string | |
| moco.mysqlCluster.auth.storageHost The hostname of the server where the database resides. | string | |
| moco.mysqlCluster.auth.storageLoc The name of the database used to store Slurm accounting records. Defaults to “slurm_acct_db”. | string | |
| moco.mysqlCluster.auth.storageUser The username slurmdbd uses for authentication and storing job accounting data. | string | |
| moco.mysqlCluster.config Additional MySQL configuration to add to the mysqlCluster. This will be placed in a ConfigMap and referenced by the mysqlCluster. The contents of the configmap will be rendered as a template, so helm expressions can be used. This needs to render as valid yaml. MySQL option file documentation
Example: click to expand | string | |
| moco.mysqlCluster.image The image to use for mysql. | object | |
| moco.mysqlCluster.inodeLockFixer.enabled Configure init container to fix inode locking issues. This init container copies, moves, and replaces the MySQL data directory to prevent known inode locking issues that can occur in certain storage environments. | bool | |
| moco.mysqlCluster.inodeLockFixer.image The image to use for the inode lock fixer init container. | object | |
| moco.mysqlCluster.persistence The volume settings to use for mysql. | object | |
| moco.mysqlCluster.resources Resources for the moco mysqld container. | object | |
| moco.priorityClassName The priority class name for moco. | string | |
| munge.args The additional arguments to pass to the munge container. The defaults run Munge with 10 threads instead of 2. | list | |
| munge.livenessProbe Liveness probe for the munge container. When munged hangs (e.g. thread deadlock), munge -n blocks on the Unix socket and the probe times out, causing Kubernetes to restart only the munged container. | object | |
| munge.readinessProbe The readiness probe for the munge container. | object | |
| munge.resources Resources for the munge container. | object | |
| munge.securityContext.runAsGroup The group to run as, must match the munge GID from the container image. | int | |
| munge.securityContext.runAsUser The user to run as, must match the munge UID from the container image. | int | |
| munge.startupProbe The startup probe for the munge container. | object | |
| mysql Options for Bitnami MySQL chart, uses Bitnami default values. There is an added option here: vmPodScrape.enabled which can be used as an alternative to serviceMonitor.enabled. | object | See Bitnami default values. |
| nsscache.annotations Additional annotations for nsscache Job resources. | object | |
| nsscache.cronJobSchedule The schedule for the nsscache update CronJob. It should be formatted according to the cron convention. | string | |
| nsscache.enabled Enable nsscache. | bool | |
| nsscache.existingSecret Name of an existing secret containing the LDAP password for this domain. This secret should contain a key named nsscache-ldap-password which contains the password to use for the LDAP bind DN. For SCIM, this secret should contain a key named nsscache-scim-auth-token which contains the token to use for the SCIM server. | string | |
| nsscache.labels Additional labels for nsscache Job resources. | object | |
| nsscache.nodeSelector.affinity The affinity for the nsscache Job. This overrides the value of global.nodeSelector.affinity. | object | |
| nsscache.nsscacheConfig Options for defining nsscache.conf. Click to exapand examples.
LDAP
SCIM | object | See the nsscache.conf documentation. |
| nsscache.nsscacheConfig.default.cache Specifying the means in which the cache data will be stored. | string | |
| nsscache.nsscacheConfig.default.files_cache_filename_suffix A suffix appended to the cache filename to differentiate it from, say, system NSS databases. | string | |
| nsscache.nsscacheConfig.default.files_dir Directory location to store the plain text files in. | string | |
| nsscache.nsscacheConfig.default.ldap_base The base to perform LDAP searches under. Example: dc=coreweave,dc=cloud | string | |
| nsscache.nsscacheConfig.default.ldap_bind_dn The bind DN to use when connecting to LDAP. Empty string is an anonymous bind. Example: cn=ldapsvc,dc=coreweave,dc=cloud | string | |
| nsscache.nsscacheConfig.default.ldap_bind_password The password to use for the LDAP bind DN. We strongly recommend using a Kubernetes secret to store this password and reference it using the nsscache.existingSecret value. | string | |
| nsscache.nsscacheConfig.default.ldap_default_shell This will be the default shell for all users. You can specify a different shell by setting the loginShell value in the user attributes in the source directory configuration. Example: /bin/bash | string | |
| nsscache.nsscacheConfig.default.ldap_rfc2307bis Example: 1 | int | |
| nsscache.nsscacheConfig.default.ldap_scope The search scope to use for LDAP. Example: sub | string | |
| nsscache.nsscacheConfig.default.ldap_uidattr The uid-like attribute in your LDAP directory. Example: cn | string | |
| nsscache.nsscacheConfig.default.ldap_uri The LDAP URI to connect to. Example: ldap://authentik-outpost-ldap-outpost | string | |
| nsscache.nsscacheConfig.default.maps The recommended defaults below are useful for standard nsscache operation in many environments. | list | |
| nsscache.nsscacheConfig.default.scim_base_url The base URL for the SCIM server. Example: https://api.coreweave.com/scim/<org> | string | |
| nsscache.nsscacheConfig.default.scim_groups_endpoint The endpoint for the SCIM groups API. | string | |
| nsscache.nsscacheConfig.default.scim_groups_parameters Option to use url parameters for groups endpoint. Special characters (spaces, quotes, etc.) will be automatically URL encoded. There is a custom parameter for creating virtual user groups that is a comma separated list. It will create an entry in the groups map for the user’s gid for the members of the selected group(s). This parameter typically should match any group filtering in scim_users_parameters. Including a filter for inactive users by default. Example: excludeInactiveUsers=true&includeVirtualUserGroups=slurm-users,slurm-admins | string | |
| nsscache.nsscacheConfig.default.scim_users_endpoint The endpoint for the SCIM users API. | string | |
| nsscache.nsscacheConfig.default.scim_users_parameters Option to use url parameters for users endpoint. Special characters (spaces, quotes, etc.) will be automatically URL encoded. There is a custom parameter for filtering by groups that is a comma separated list. Including a filter for inactive users by default. Example: filter=active eq “true”&groups=slurm-users,slurm-admins | string | filter=active eq “true” |
| nsscache.nsscacheConfig.default.source Specify the data source to use. Supported options are scim and ldap. | string | |
| nsscache.nsscacheConfig.default.timestamp_dir Specifying the location of the timestamps used for incremental updates. | string | |
| nsscache.nsscacheConfig.group.scim_path_gid The SCIM path for the GID attribute. | string | |
| nsscache.nsscacheConfig.group.scim_path_groupname The SCIM path for the group name attribute. Used when the SCIM server provides a custom field for group names. If not specified or the path returns no value, nsscache will fall back to using displayName, name, or id from the SCIM group resource. | string | |
| nsscache.nsscacheConfig.group.scim_path_username The SCIM path for the GID attribute. | string | |
| nsscache.nsscacheConfig.passwd.ldap_filter The search filter to use when querying. Example: (objectClass=user) | string | |
| nsscache.nsscacheConfig.passwd.ldap_override_home_dir This will override the home directory all users. %%u will be replaced with the username. this should match the mount found in compute.VolumeMounts Example: /mtn/home/%%u | string | |
| nsscache.nsscacheConfig.passwd.scim_default_shell This will be the default shell for all users. | string | |
| nsscache.nsscacheConfig.passwd.scim_override_home_directory This will override the home directory all users. %%u will be replaced with the username. this should match the mount found in compute.VolumeMounts Example: /mnt/home/%%u | string | |
| nsscache.nsscacheConfig.passwd.scim_path_gid The SCIM path for the GID attribute. | string | |
| nsscache.nsscacheConfig.passwd.scim_path_home_directory The SCIM path for the home directory attribute. | string | |
| nsscache.nsscacheConfig.passwd.scim_path_login_shell The SCIM path for the login shell attribute. | string | |
| nsscache.nsscacheConfig.passwd.scim_path_uid The SCIM path for the UID attribute. | string | |
| nsscache.nsscacheConfig.passwd.scim_path_username The SCIM path for the username attribute. | string | |
| nsscache.nsscacheConfig.shadow.ldap_filter The search filter to use when querying. Example: (objectClass=user) | string | |
| nsscache.nsscacheConfig.shadow.scim_path_username The SCIM path for the username attribute. | string | |
| nsscache.nsscacheConfig.sshkey.ldap_filter The search filter to use when querying. Example: (objectClass=user) | string | |
| nsscache.nsscacheConfig.sshkey.scim_path_ssh_keys The SCIM path for the SSH keys attribute. | string | |
| nsscache.nsscacheConfig.sshkey.scim_path_username The SCIM path for the username attribute. | string | |
| nsscache.nsswitchConfig Options for defining nsswitch.conf. | object | See the nsswitch.conf documentation. |
| nsscache.nsswitchConfig.aliases Mail aliases, used by getaliasent(3) and related functions. | list | |
| nsscache.nsswitchConfig.ethers Ethernet numbers. | list | |
| nsscache.nsswitchConfig.group Groups of users, used by getgrent(3) and related functions. | list | |
| nsscache.nsswitchConfig.hosts Host names and numbers, used by gethostbyname(3) and related functions. | list | |
| nsscache.nsswitchConfig.initgroups Supplementary group access list, used by getgrouplist(3) function. | list | |
| nsscache.nsswitchConfig.netgroup Network-wide list of hosts and users, used for access rules. C libraries before glibc 2.1 supported netgroups only over NIS. | list | |
| nsscache.nsswitchConfig.networks Network names and numbers, used by getnetent(3) and related functions. | list | |
| nsscache.nsswitchConfig.passwd User passwords, used by getpwent(3) and related functions. | list | |
| nsscache.nsswitchConfig.protocols Network protocols, used by getprotoent(3) and related functions. | list | |
| nsscache.nsswitchConfig.publickey Public and secret keys for Secure_RPC used by NFS and NIS+. | list | |
| nsscache.nsswitchConfig.rpc Remote procedure call names and numbers, used by getrpcbyname(3) and related functions. | list | |
| nsscache.nsswitchConfig.services Network services, used by getservent(3) and related functions. | list | |
| nsscache.nsswitchConfig.shadow Shadow user passwords, used by getspnam(3) and related functions. | list | |
| nsscache.priorityClassName The priority class name for the nsscache Job pod. | string | |
| nsscache.resources Resources for the nsscache Job container. | object | |
| nsscache.slurmUserProvisioning.defaultSlurmAccount The default Slurm account for automated provisioning of users. | string | |
| nsscache.slurmUserProvisioning.dryRun Enable dry run mode - shows what would be done without making changes. | bool | |
| nsscache.slurmUserProvisioning.enabled Enable slurmUserProvisioning. | bool | |
| nsscache.slurmUserProvisioning.interval The interval in seconds between user sync runs. | int | |
| nsscache.sudoGroups List of Unix groups with sudo privileges. | list | |
| nsscache.tolerations The tolerations for the nsscache Job | list | |
| rest.annotations Additional annotations for REST API resources. | object | |
| rest.args The additional arguments to pass to the rest container. Defaults enable debug logging and only load most recent openAPI plugins. | list | |
| rest.containers Additional sidecar containers to add to the restd pod. | list | |
| rest.enabled Enable the REST API deployment This is optional and should be disabled for most use cases. | bool | |
| rest.env The additional environment variables to pass to the rest container. | list | |
| rest.image The image to use for the REST API deployment. | object | |
| rest.labels Additional labels for REST API resources. | object | |
| rest.livenessProbe The liveness probe for the rest container. | object | |
| rest.priorityClassName The priority class name for the rest pod. | string | |
| rest.readinessProbe The readiness probe for the rest container. | object | |
| rest.replicas The number of replicas of the rest pod to run. In most production environments this should be set to a minimum of 2 to provide HA. | int | |
| rest.resources Resources for the slurmrestd container. These defaults are appropriate for small and medium-sized clusters. | object | |
| rest.securityContext.runAsGroup The group to run as, GID must exist in the container image. | int | |
| rest.securityContext.runAsUser The user to run as, UID must exist in the container image. | int | |
| rest.service.additionalPorts Additional port definitions to expose.
Example: click to expand | list | |
| rest.service.annotations Additional annotations to apply to rest service. | object | |
| rest.service.clusterIP | string | |
| rest.service.enabled Enable the creation of service for rest pods. | bool | |
| rest.service.externalTrafficPolicy The external traffic policy. | string | |
| rest.service.labels Additional labels to apply rest service. | object | |
| rest.service.loadBalancerClass The load balancer class to use for the rest services. | string | |
| rest.service.type The type of service to create. This defaults to ClusterIP. | string | |
| rest.startupProbe The startup probe for the rest container. | object | |
| rest.terminationGracePeriodSeconds The termination grace period for the rest pod. | int | |
| rest.volumeMounts Additional volume mounts to apply to the rest pod. | list | |
| rest.volumes Additional volumes to add to the restd pod.
Example: click to expand | list | |
| scheduler.annotations Additional annotations for scheduler resources. | object | |
| scheduler.config.scheduler.gpuTypes Mapping of k8s gpu types to Slurm gpu types. The keys represent GPU types required during scheduling from the node affinity using the key “gpu.nvidia.com/class” and the values represent the gres gpu type in Slurm. This gets added to a job’s description. | map | |
| scheduler.config.scheduler.pollInterval The polling interval for the Slurm API. | string | |
| scheduler.config.scheduler.terminationOffset offset termination grace period to account for communication delays etc. | string | |
| scheduler.config.slurm.poolSize The number of connections to be maintained in the connection pool. | int | |
| scheduler.config.slurm.protocolVersion The protocol version to use for communication with the Slurm controller. | string | |
| scheduler.config.slurm.usePersistentConnection Use Slurm’s persistent connections for connection reuse. | bool | |
| scheduler.controllerAddress The address of the Slurm controller to connect to. This should be the service address of the controller in host:port format. | string | |
| scheduler.enabled Enable the scheduler. To schedule k8s pods on the Slurm cluster nodes, this must be enabled. | bool | |
| scheduler.hooksAPI config for the webhooks. | object | |
| scheduler.hooksAPI.waitForPodDeletionInterval The polling interval when checking for pod deletion. | string | |
| scheduler.image The image to use for the scheduler. | object | |
| scheduler.labels Additional labels for scheduler resources. | object | |
| scheduler.livenessProbe The liveness probe for the scheduler container. | object | |
| scheduler.logLevel The log level. Uses integers or zap log level strings:
| string | |
| scheduler.maxConcurrentReconciles The maximum concurrent reconciles. This should be adjusted based on the volume of pods using the scheduler to handle bursts operations quickly. The size of both the Slurm and Kubernetes clusters will impact this but less than the syncer. The driving factor here tends to be the pod volume and associated Slurm jobs more than anything else. Using the same value as the syncer should be a rather conservative starting point in many use cases. | int | |
| scheduler.name The name of the scheduler used to select the scheduler during pod creation. By default the name is based on the namespace and release name <namespace>-<release>-scheduler when not set. | string | |
| scheduler.priorityClassName The priority class name for the scheduler pod. | string | |
| scheduler.readinessProbe The readiness probe for the scheduler container. | object | |
| scheduler.resources Resources for the scheduler container. | object | |
| scheduler.scope.namespaces The list of the namespaces to scope the scheduler to. Only used when scope.type is set to namespace. Namespaces other than the release namespace will need role bindings created. | list | |
| scheduler.scope.type The type can be cluster or namespace. | string | |
| scheduler.startupProbe The startup probe for the scheduler container. | object | |
| secretJob.annotations Additional annotations for secret Job resources. | object | |
| secretJob.labels Additional labels for secret Job resources. | object | |
| secretJob.nodeSelector.affinity The affinity for the secret job. This overrides the value of global.nodeSelector.affinity. | object | |
| secretJob.priorityClassName The priority class name for the secret job pod. | string | |
| secretJob.resources Resources for the secret job container. | object | |
| secretJob.tolerations The tolerations for the secret job | list | |
| slurm-login Configure individual login nodes via slurm-login subchart. Below is an example showing some of the key parameters of the subchart, see subchart docs for all parameters.
Example: click to expand | object | See default values in slurm-login subchart. |
| slurmConfig.AccountingStorageEnforce Controls what level of association-based enforcement to impose on job submissions. Multiple values allowed. Valid options are any combination of:
all to impose everything except nojobs and nosteps, which must be requested separately. See the Slurm documentation for more details. | list | |
| slurmConfig.AccountingStorageTRES | string | |
| slurmConfig.AccountingStorageType | string | |
| slurmConfig.AuthAltParameters[0] | string | |
| slurmConfig.AuthAltTypes[0] | string | |
| slurmConfig.BatchStartTimeout | int | |
| slurmConfig.CommunicationParameters The list of communication parameters to pass to slurmCtld. See the Slurm documentation for possible values. | list | |
| slurmConfig.DebugFlags Comma-separated debug flags for logging. | string | |
| slurmConfig.DefMemPerCPU The default memory per CPU in megabytes. Sets the slurm.conf parameter of the default real memory size available per usable allocated CPU in megabytes. This value is used when the —mem-per-cpu option is not specified on the srun command line. | int | |
| slurmConfig.Epilog | string | |
| slurmConfig.GresTypes[0] | string | |
| slurmConfig.InactiveLimit Terminate job allocation commands, such as srun or salloc, that are unresponsive longer than this interval in seconds. See the slurm.conf reference for more details. | int | |
| slurmConfig.JobAcctGatherFrequency | int | |
| slurmConfig.JobAcctGatherType | string | |
| slurmConfig.JobCompType | string | |
| slurmConfig.JobSubmitPlugins The job submit plugins to use. | list | |
| slurmConfig.KillWait The interval in seconds between the SIGTERM and SIGKILL signals given to a job’s processes upon reaching its time limit. See the slurm.conf reference for more details. | int | |
| slurmConfig.MaxNodeCount | int | |
| slurmConfig.MessageTimeout | int | |
| slurmConfig.MinJobAge | int | |
| slurmConfig.MpiDefault | string | |
| slurmConfig.ProctrackType The plugin to be used for process tracking on a job step basis. See the Slurm documentation for more details. Valid values:
| string | |
| slurmConfig.Prolog | string | |
| slurmConfig.PrologFlags[0] | string | |
| slurmConfig.PrologFlags[1] | string | |
| slurmConfig.RebootProgram | string | |
| slurmConfig.ReturnToService | int | |
| slurmConfig.SUNKJobDashboardURL | string | |
| slurmConfig.SUNKNodeDashboardURL | string | |
| slurmConfig.SchedulerParameters[0] | string | |
| slurmConfig.SchedulerParameters[1] | string | |
| slurmConfig.SchedulerType | string | |
| slurmConfig.SelectType | string | |
| slurmConfig.SelectTypeParameters The values to use for the parameters of the select/cons_tres plugin. Allowed values depend on the configured value of SelectType. See the slurm.conf reference for more details. | string | |
| slurmConfig.SlurmSchedLogFile | string | |
| slurmConfig.SlurmSchedLogLevel | int | |
| slurmConfig.SlurmUser | string | |
| slurmConfig.SlurmctldDebug | string | |
| slurmConfig.SlurmctldLogFile | string | |
| slurmConfig.SlurmctldParameters The list of additional parameters to pass to slurmCtld. See the Slurm documentation for possible values. | list | |
| slurmConfig.SlurmctldPidFile | string | |
| slurmConfig.SlurmctldPort | int | |
| slurmConfig.SlurmctldTimeout The interval, in seconds, that the backup controller waits for the primary controller to respond before assuming control. The default value is 120 seconds. May not exceed 65533. | int | |
| slurmConfig.SlurmdDebug | string | |
| slurmConfig.SlurmdLogFile | string | |
| slurmConfig.SlurmdPidFile | string | |
| slurmConfig.SlurmdPort | int | |
| slurmConfig.SlurmdSpoolDir | string | |
| slurmConfig.SlurmdTimeout The interval, in seconds, that the Slurm controller waits for slurmd to respond before configuring that node’s state to DOWN. | int | |
| slurmConfig.StateSaveLocation | string | |
| slurmConfig.SuspendTime Nodes which remain idle or down for this number of seconds will be placed into power save mode by SuspendProgram. | string | |
| slurmConfig.SwitchType | string | |
| slurmConfig.TCPTimeout | int | |
| slurmConfig.TaskPlugin The task plugin to use. See the Slurm documentation for more details. Multiple comma-separated values allowed. Valid values:
| string | |
| slurmConfig.TaskPluginParam Optional parameters for the task plugin. See the Slurm documentation for more details. | string | |
| slurmConfig.TopologyParam[0] | string | |
| slurmConfig.TopologyPlugin | string | |
| slurmConfig.TreeWidth | int | |
| slurmConfig.UnkillableStepProgram | string | |
| slurmConfig.UnkillableStepTimeout | int | |
| slurmConfig.WaitTime Specifies how many seconds the srun command should wait after the first task terminates before terminating all remaining tasks. Using the —wait option on the srun command line overrides this value. The default value is 0, which disables this feature. See the slurm.conf reference for more details. | int | |
| slurmConfig.cgroupConfig The cgroup.conf value. This is only used when ProctrackType is set to proctrack/cgroup. Note: cgroup/v2 should be used over autodetect on systems using cgroup v2. | object | |
| sssdContainer.enabled Enable the sssd sidecar container. | bool | |
| sssdContainer.livenessProbe The liveness probe for the sssd container. | object | |
| sssdContainer.readinessProbe The readiness probe for the sssd container. | object | |
| sssdContainer.startupProbe The startup probe for the sssd container. | object | |
| syncer.annotations Additional annotations for syncer resources. | object | |
| syncer.config.slurm.poolSize The number of connections to be maintained in the connection pool. | int | |
| syncer.config.slurm.protocolVersion The protocol version to use for communication with the Slurm controller. | string | |
| syncer.config.slurm.usePersistentConnection Use Slurm’s persistent connections for connection reuse. | bool | |
| syncer.config.syncer.nodesetUpdateJobPreemption Configuration for job preemption support. More details can be found in the changelog | object | |
| syncer.config.syncer.nodesetUpdateJobPreemption.enabled Enable job preemption support. | bool | |
| syncer.config.syncer.nodesetUpdateJobPreemption.method Job preemption strategy during rolling upgrades, can be set to one of the following methods:
| string | |
| syncer.config.syncer.orphanedPodDelay The delay to wait before deleting a pod that is no longer associated with a Slurm node. | string | |
| syncer.config.syncer.pollInterval The polling interval for the Slurm API. | string | |
| syncer.config.syncer.qosInterruptable The externally defined label to indicate if pod is interruptable. | string | |
| syncer.config.syncer.reconfigureRateLimit The rate limit, in seconds, for Slurm reconfigure requests based on additions to NodeSlices. The value must be above 0 seconds to enable this feature. Warning: if this value is too low, scontrol reconfigure may be executed too often, especially during periods when several nodes are newly added. | string | |
| syncer.config.syncer.slurmNodeCleanUp Removes lingering Slurm nodes from the cluster after they have been removed from their associated SUNK NodeSets. | bool | |
| syncer.controllerAddress The address of the Slurm controller to connect to. This should be the service address of the controller in host:port format. | string | |
| syncer.enabled Enable the syncer. This is required for most functionality and should only be disabled for troubleshooting. | bool | |
| syncer.hooksAPI config for the webhooks. | object | |
| syncer.hooksAPI.nodeRebootCondition Condition to indicate node should be rebooted. | string | |
| syncer.hooksAPI.nodeRebootReason The target NLCC lifecycle state associated with the nodeRebootCondition. | string | |
| syncer.hooksAPI.safeNodeRebootCondition Condition to indicate node should be rebooted safely. | string | |
| syncer.hooksAPI.safeNodeRebootReason The target NLCC lifecycle state associated with the safeNodeRebootCondition. | string | |
| syncer.hooksAPI.waitForNodeLockedInterval The polling interval when checking for node locked state. | string | |
| syncer.hooksAPI.waitForNodeLockedTimeout The timeout for checking node locked state. | string | |
| syncer.image The image to use for the syncer. | object | |
| syncer.labels Additional labels for syncer resources. | object | |
| syncer.livenessProbe The liveness probe for the syncer container. | object | |
| syncer.logLevel The log level. Uses integers or zap log level strings:
| string | |
| syncer.maxConcurrentReconciles The maximum concurrent reconciles. This should be adjusted based on the number of nodes and size of jobs launched in the Slurm cluster, to handle bursts operations quickly. A value 1/10th the number of nodes in the cluster is a good starting point for small clusters. As cluster size increases, this value can be a smaller fraction of the total number of nodes in most cases. For instance a value of 50 seems to handle a 2000 node cluster well. Being too aggressive here will bottleneck on other components such as the Kubernetes API server and the Slurm controller, which in some cases may cause errors. | int | |
| syncer.nodePermissions.enabled Enable node operations on the syncer, currently this allows restart of nodes when enabled. | bool | |
| syncer.priorityClassName The priority class name for the syncer pod. | string | |
| syncer.readinessProbe The readiness probe for the syncer container. | object | |
| syncer.resources Resources for the syncer container. | object | |
| syncer.startupProbe The startup probe for the syncer container. | object | |
| syncer.watchAllNodeSets Watch all NodeSets in the namespace. This overrides default behavior of only watching the NodeSets deployed with this chart release. | bool | |
| syncer.watchNodeSets The list of NodeSets to watch. This overrides the default behavior of watching the NodeSets deployed with this chart release to instead watch this specific list. This is not used if watchAllNodeSets is set to true. | list | |
| userLookupContainer.livenessProbe The liveness probe for the user-lookup container. | object | |
| userLookupContainer.readinessProbe The readiness probe for the user-lookup container. | object | |
| userLookupContainer.startupProbe The startup probe for the user-lookup container. | object |