Define Compute Nodes
Define the resources used to run Slurm jobs
Slurm Compute nodes are the workhorses of the cluster, defining and managing the specific hardware used to run Slurm jobs. Compute nodes handle the resources used to run jobs submitted to Slurm.
Slurm Login nodes allow you to access your Slurm cluster, while Slurm Compute nodes perform the actual work of running a Slurm job.
With SUNK, you can create flexible Compute node definitions to meet the specific resource requirements of the workloads. This guide will walk you through the various methods of defining Compute nodes that are optimized for performance and efficiency of the desired jobs.
In SUNK, Slurm nodes run in Kubernetes Pods. These are not the same as Kubernetes Nodes, which are the worker machines that run the Pods. To distinguish between the two, Kubernetes Nodes are capitalized in this documentation, while Slurm nodes are not.
Access Slurm Compute nodes
After accessing your Slurm cluster through the Slurm Login node, you can interact with the Slurm Compute nodes using standard Slurm commands. It is not necessary to directly access a Slurm Compute node.
Use Slurm commands, such as srun
, sbatch
, or salloc
, to run and manage jobs on Slurm Compute nodes.
We do not recommend directly accessing Slurm Compute nodes through SSH to run tasks. Bypassing Slurm can interfere with currently running jobs and may cause nodes to drain unintentionally, leading to temporary loss of resources. SSH to Slurm Compute nodes should only be used for debugging existing jobs on the nodes.
The manifest
The foundation for defining Compute nodes is a YAML manifest, which outlines the resources and configurations for each node type. An example of the compute:
section might look like this:
compute:# See "Global Compute node options" below to learn more.volumeMounts: []volumes: []s6: {}pyxis:partitions:# Node definitions. Multiple node definitions are allowed, but# only those `enabled: true` will be deployed.nodes:# The node definition name.reservation-id:# Defines a node with affinity for a specific reservation ID.affinity:nodeAffinity:requiredDuringSchedulingIgnoredDuringExecution:nodeSelectorTerms:- matchExpressions:- key: node.coreweave.cloud/reservedoperator: Invalues:- <my-reservation-id># Another node definition.my-node-def:enabled: truereplicas: 1definitions:# Use the `reservation-id` defined above.- reservation-idstaticFeatures:- foo- bardynamicFeatures:node.coreweave.cloud/class: {}gpu.nvidia.com/class: {}image:repository: registry.gitlab.com/exampleenv:- name: examplevalue: "1"gresGpu: h100:8config:weight: 1resources:limits:memory: 960Gisunk.coreweave.com/accelerator: "8"rdma/ib: "1"requests:cpu: "110"memory: 960Gisunk.coreweave.com/accelerator: "8"
Global options
These are the global options shown in the YAML example above.
compute.volumeMounts
Declares a list of additional volumes to mount within the primary container of the node in addition to the chart global.volumeMounts
.
For example:
compute:volumeMounts:- name: my-pvcpath: /mnt/my-pvc
- Entries which share the same
path
as a globally defined mount will override the mount. - These volumeMounts are also added to the login node primary container.
compute.volumes
Declares a list of additional volumes to attach to the Pod for the compute node. If using persistent volumes claims, ReadWriteMany access mode should be used in most cases.
For example:
compute:volumes:- name: my-pvcpersistentVolumeClaim:claimName: my-pvc
- Entries which share the same
name
as a globally defined mount will override the volume. - These volumes are also added to the login node Pods.
compute.s6
SUNK can run custom s6 scripts on Compute nodes, either as oneshot
or longrun
jobs.
For example:
compute:s6:packages:type: oneshottimeoutUp: 0timeoutDown: 0script: |#!/usr/bin/env bashapt -y install nginxnginx:type: longruntimeoutUp: 0timeoutDown: 0script: |#!/usr/bin/env bashnginx -g "daemon off;"
compute.pyxis
This section has multiple options:
Parameter | Purpose |
---|---|
compute.pyxis.enabled | Enables the pyxis container. |
compute.pyxis.mountHome | Enables ENROOT-MOUNT_HOME for the pyxis container to mount the home directory. |
compute.pyxis.remapRoot | Enables ENROOT_REMAP_ROOT for the pyxis container to remap the root user. |
compute.pyxis.securityContext.capabilities.add | Adds capabilities to the pyxis container. "SYS_ADMIN" is required if using Pyxis. |
For example:
compute:pyxis:enabled: truemountHome: trueremapRoot: truesecurityContext:capabilities:add: ["SYS_ADMIN"]
compute.partitions
A Slurm partition is a logical grouping of Compute nodes (servers) within the Slurm cluster. It's a way to organize nodes based on their characteristics, such as memory size, CPU type, or GPU availability.
When a user submits a job to a Slurm-managed HPC cluster, they specify the partition where the job should run. The Slurm scheduler then assigns the job to an available node within that partition. Partitions can have different configurations and policies, such as time limits for jobs, user access restrictions, or priority levels.
A related option is compute.autoPartition.enabled
, which, if true
(the default), creates a partition within Slurm for each NodeSet defined in compute.nodes
. The partition name will match the name of the nodes
section.
Other global options
In addition to the options shown in the compute
example above, several others apply globally to all Compute nodes.
Parameter | Purpose |
---|---|
compute.generateTopology | If true , generate the network topology. |
compute.initialState and compute.initialStateReason | The initial State for the nodes when they join the Slurm cluster, generally drain or idle , and the reason for setting that state. These can also be applied as node-specific options. |
compute.maxUnavailable | Sets the maximum unavailability of the Compute nodes during a rolling update. Can be a percentage or a number. |
compute.ssh.enabled | When enabled, the Slurm Compute nodes will have SSH available. |
Node-specific options
Many options are available for each named node
definition. For reference, see my-node-def
in the YAML above which shows many of the available options.
node.enabled
: Iftrue
, then Compute nodes should be deployed with this definition. Multiple definitions can be declared, but those onlyenabled: true
will be deployed.node.replicas
: Specifies the desired number of Slurm nodes (Kubernetes Pods) of this type that the NodeSet will attempt to create. This is a maximum value, because the number of desired Pods can be greater than the number of available Pods. To change the number of replicas for a running Slurm cluster, use:
$kubectl scale nodeset <nodeset-name> --replicas=N
node.definitions
: A list of other node definitions to include in this definition. See Custom Node Definitions below to learn how to create custom definitions.node.staticFeatures
: Static Slurm node feature flags. Feature flags are strings that Slurm will add to the Slurm nodes, where they are available for use when scheduling Slurm jobs. For example, to schedule a job only on nodes with the featurereally-fast
:
$srun -C really-fast hostname
Here's an example of how it looks within Slurm.
NodeName=h100-092-02 Arch=x86_64 CoresPerSocket=32CPUAlloc=110 CPUEfctv=128 CPUTot=128 CPULoad=0.56AvailableFeatures=h100-pci4,pci-4,cu120,gpu,infiniband,sharpActiveFeatures=h100-pci4,pci-4,cu120,gpu,infiniband,sharp
node.dynamicFeatures
: Dynamic Slurm node features from Kubernetes' Node labels. This specifies a map of labels that should be used as additional feature flags within Slurm. Note: the value for each map key is{}
as there is no further configuration at this time.node.image
: Specifies which Docker image repository used to pull this node's image.
node.env
: Sets extra environment variables to be exposed in the Compute nodes.node.gresGpu
: Sets the Slurm Generic Resource Scheduling value for thegpu
GresType. This describes the type and number of GPU Generic Resources for this Slurm node type.node.config
: Add additional config options to the slurmd startup used during dynamic node registration. The features and gres options are already set. See Node Parameters and Node Configuration for more details on the options and values.node.resources
: Sets the Kubernetes Compute resource limits and requests.node.affinity
: Sets the Kubernetes Node affinities, ensuring that the node is scheduled with a specific GPU model.node.initialState
andnode.initialStateReason
: The initial State for the nodes when they join the Slurm cluster, generallydrain
oridle
, and the reason for setting that state. These can also be applied as a general option for the cluster.node.volumeMounts
: Additional per node definition volumeMounts to add to the primary container, same format ascompute.volumeMounts
. Mounts that match onpath
will override those set at the higher level.node.volumes
: Additional per node definition volumes to add to the Pod, same format ascompute.volumes
. Volumes that match onname
will override those set at the higher level.node.containers
: Additional per node containers (e.g. sidecars) to add to the Pod. Additional configuration for these containers (such as Secrets and ConfigMaps) must be in the Slurm namespace.node.dnsPolicy
: Adjust the dnsPolicy for each node.node.dnsConfig
: Adjust the dnsConfig for each node.
Custom node definitions
Node definitions can reference other node definitions to include or overlay values. These "layers" can be defined in the same values.yaml
file, or in separate files.
As shown in the prior example, there is a node definition named reservation-id
:
compute:nodes:reservation-id:affinity:nodeAffinity:requiredDuringSchedulingIgnoredDuringExecution:nodeSelectorTerms:- matchExpressions:- key: node.coreweave.cloud/reservedoperator: Invalues:- <my-reservation-id>
That layer is included in the my-node-def
definition.
compute:nodes:my-node-def:definitions:- reservation-id
You can store custom layers as separate values files. Any key defined under compute.nodes
can be used, even if that key is another file, by specifying multiple values files in a defined order on the command-line.
For example, consider a custom-compute-defs-values.yaml
file that only has a compute.nodes
section with custom layers defined. The values.yaml
file can use those definitions as long as both value files are used when deploying, like so:
$helm install slurm coreweave/slurm -f custom-compute-defs-values.yaml -f values.yaml
Mixing CPU and GPU node types
You can mix multiple Slurm node types by defining multiple NodeSets in different blocks under compute.nodes
. Each NodeSet can have its own resources and affinities that specify a single type of node.
For example, it's possible to create a NodeSet that selects a particular type of GPU, while another selects CPU-only nodes, and then deploy any desired number of each node type.