Skip to main content

Define Compute Nodes

Define the resources used to run Slurm jobs

Slurm Compute nodes are the workhorses of the cluster, defining and managing the specific hardware used to run Slurm jobs. Compute nodes handle the resources used to run jobs submitted to Slurm.

Slurm Login nodes allow you to access your Slurm cluster, while Slurm Compute nodes perform the actual work of running a Slurm job.

With SUNK, you can create flexible Compute node definitions to meet the specific resource requirements of the workloads. This guide will walk you through the various methods of defining Compute nodes that are optimized for performance and efficiency of the desired jobs.

Note

In SUNK, Slurm nodes run in Kubernetes Pods. These are not the same as Kubernetes Nodes, which are the worker machines that run the Pods. To distinguish between the two, Kubernetes Nodes are capitalized in this documentation, while Slurm nodes are not.

Access Slurm Compute nodes

After accessing your Slurm cluster through the Slurm Login node, you can interact with the Slurm Compute nodes using standard Slurm commands. It is not necessary to directly access a Slurm Compute node.

Use Slurm commands, such as srun, sbatch, or salloc, to run and manage jobs on Slurm Compute nodes.

Warning

We do not recommend directly accessing Slurm Compute nodes through SSH to run tasks. Bypassing Slurm can interfere with currently running jobs and may cause nodes to drain unintentionally, leading to temporary loss of resources. SSH to Slurm Compute nodes should only be used for debugging existing jobs on the nodes.

The manifest

The foundation for defining Compute nodes is a YAML manifest, which outlines the resources and configurations for each node type. An example of the compute: section might look like this:

Example
compute:
# See "Global Compute node options" below to learn more.
volumeMounts: []
volumes: []
s6: {}
pyxis:
partitions:
# Node definitions. Multiple node definitions are allowed, but
# only those `enabled: true` will be deployed.
nodes:
# The node definition name.
reservation-id:
# Defines a node with affinity for a specific reservation ID.
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node.coreweave.cloud/reserved
operator: In
values:
- <my-reservation-id>
# Another node definition.
my-node-def:
enabled: true
replicas: 1
definitions:
# Use the `reservation-id` defined above.
- reservation-id
staticFeatures:
- foo
- bar
dynamicFeatures:
node.coreweave.cloud/class: {}
gpu.nvidia.com/class: {}
image:
repository: registry.gitlab.com/example
env:
- name: example
value: "1"
gresGpu: h100:8
config:
weight: 1
resources:
limits:
memory: 960Gi
sunk.coreweave.com/accelerator: "8"
rdma/ib: "1"
requests:
cpu: "110"
memory: 960Gi
sunk.coreweave.com/accelerator: "8"

Global options

These are the global options shown in the YAML example above.

compute.volumeMounts

Declares a list of additional volumes to mount within the primary container of the node in addition to the chart global.volumeMounts.

For example:

Example
compute:
volumeMounts:
- name: my-pvc
path: /mnt/my-pvc
Note
  • Entries which share the same path as a globally defined mount will override the mount.
  • These volumeMounts are also added to the login node primary container.

compute.volumes

Declares a list of additional volumes to attach to the Pod for the compute node. If using persistent volumes claims, ReadWriteMany access mode should be used in most cases.

For example:

Example
compute:
volumes:
- name: my-pvc
persistentVolumeClaim:
claimName: my-pvc
Note
  • Entries which share the same name as a globally defined mount will override the volume.
  • These volumes are also added to the login node Pods.

compute.s6

SUNK can run custom s6 scripts on Compute nodes, either as oneshot or longrun jobs.

For example:

Example
compute:
s6:
packages:
type: oneshot
timeoutUp: 0
timeoutDown: 0
script: |
#!/usr/bin/env bash
apt -y install nginx
nginx:
type: longrun
timeoutUp: 0
timeoutDown: 0
script: |
#!/usr/bin/env bash
nginx -g "daemon off;"

compute.pyxis

This section has multiple options:

ParameterPurpose
compute.pyxis.enabledEnables the pyxis container.
compute.pyxis.mountHomeEnables ENROOT-MOUNT_HOME for the pyxis container to mount the home directory.
compute.pyxis.remapRootEnables ENROOT_REMAP_ROOT for the pyxis container to remap the root user.
compute.pyxis.securityContext.capabilities.addAdds capabilities to the pyxis container. "SYS_ADMIN" is required if using Pyxis.

For example:

Example
compute:
pyxis:
enabled: true
mountHome: true
remapRoot: true
securityContext:
capabilities:
add: ["SYS_ADMIN"]

compute.partitions

A Slurm partition is a logical grouping of Compute nodes (servers) within the Slurm cluster. It's a way to organize nodes based on their characteristics, such as memory size, CPU type, or GPU availability.

When a user submits a job to a Slurm-managed HPC cluster, they specify the partition where the job should run. The Slurm scheduler then assigns the job to an available node within that partition. Partitions can have different configurations and policies, such as time limits for jobs, user access restrictions, or priority levels.

A related option is compute.autoPartition.enabled, which, if true (the default), creates a partition within Slurm for each NodeSet defined in compute.nodes. The partition name will match the name of the nodes section.

Other global options

In addition to the options shown in the compute example above, several others apply globally to all Compute nodes.

ParameterPurpose
compute.generateTopologyIf true, generate the network topology.
compute.initialState and compute.initialStateReasonThe initial State for the nodes when they join the Slurm cluster, generally drain or idle, and the reason for setting that state. These can also be applied as node-specific options.
compute.maxUnavailableSets the maximum unavailability of the Compute nodes during a rolling update. Can be a percentage or a number.
compute.ssh.enabledWhen enabled, the Slurm Compute nodes will have SSH available.

Node-specific options

Many options are available for each named node definition. For reference, see my-node-def in the YAML above which shows many of the available options.

  • node.enabled: If true, then Compute nodes should be deployed with this definition. Multiple definitions can be declared, but those only enabled: true will be deployed.
  • node.replicas: Specifies the desired number of Slurm nodes (Kubernetes Pods) of this type that the NodeSet will attempt to create. This is a maximum value, because the number of desired Pods can be greater than the number of available Pods. To change the number of replicas for a running Slurm cluster, use:
Example
$
kubectl scale nodeset <nodeset-name> --replicas=N
  • node.definitions: A list of other node definitions to include in this definition. See Custom Node Definitions below to learn how to create custom definitions.
  • node.staticFeatures: Static Slurm node feature flags. Feature flags are strings that Slurm will add to the Slurm nodes, where they are available for use when scheduling Slurm jobs. For example, to schedule a job only on nodes with the feature really-fast:
Example
$
srun -C really-fast hostname

Here's an example of how it looks within Slurm.

Example
NodeName=h100-092-02 Arch=x86_64 CoresPerSocket=32
CPUAlloc=110 CPUEfctv=128 CPUTot=128 CPULoad=0.56
AvailableFeatures=h100-pci4,pci-4,cu120,gpu,infiniband,sharp
ActiveFeatures=h100-pci4,pci-4,cu120,gpu,infiniband,sharp
  • node.dynamicFeatures: Dynamic Slurm node features from Kubernetes' Node labels. This specifies a map of labels that should be used as additional feature flags within Slurm. Note: the value for each map key is {} as there is no further configuration at this time.
  • node.image: Specifies which Docker image repository used to pull this node's image.
  • node.env: Sets extra environment variables to be exposed in the Compute nodes.
  • node.gresGpu: Sets the Slurm Generic Resource Scheduling value for the gpu GresType. This describes the type and number of GPU Generic Resources for this Slurm node type.
  • node.config: Add additional config options to the slurmd startup used during dynamic node registration. The features and gres options are already set. See Node Parameters and Node Configuration for more details on the options and values.
  • node.resources: Sets the Kubernetes Compute resource limits and requests.
  • node.affinity: Sets the Kubernetes Node affinities, ensuring that the node is scheduled with a specific GPU model.
  • node.initialState and node.initialStateReason: The initial State for the nodes when they join the Slurm cluster, generally drain or idle, and the reason for setting that state. These can also be applied as a general option for the cluster.
  • node.volumeMounts: Additional per node definition volumeMounts to add to the primary container, same format as compute.volumeMounts. Mounts that match on path will override those set at the higher level.
  • node.volumes: Additional per node definition volumes to add to the Pod, same format as compute.volumes. Volumes that match on name will override those set at the higher level.
  • node.containers: Additional per node containers (e.g. sidecars) to add to the Pod. Additional configuration for these containers (such as Secrets and ConfigMaps) must be in the Slurm namespace.
  • node.dnsPolicy: Adjust the dnsPolicy for each node.
  • node.dnsConfig: Adjust the dnsConfig for each node.

Custom node definitions

Node definitions can reference other node definitions to include or overlay values. These "layers" can be defined in the same values.yaml file, or in separate files.

As shown in the prior example, there is a node definition named reservation-id:

Example
compute:
nodes:
reservation-id:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node.coreweave.cloud/reserved
operator: In
values:
- <my-reservation-id>

That layer is included in the my-node-def definition.

Example
compute:
nodes:
my-node-def:
definitions:
- reservation-id

You can store custom layers as separate values files. Any key defined under compute.nodes can be used, even if that key is another file, by specifying multiple values files in a defined order on the command-line.

For example, consider a custom-compute-defs-values.yaml file that only has a compute.nodes section with custom layers defined. The values.yaml file can use those definitions as long as both value files are used when deploying, like so:

Example
$
helm install slurm coreweave/slurm -f custom-compute-defs-values.yaml -f values.yaml

Mixing CPU and GPU node types

You can mix multiple Slurm node types by defining multiple NodeSets in different blocks under compute.nodes. Each NodeSet can have its own resources and affinities that specify a single type of node.

For example, it's possible to create a NodeSet that selects a particular type of GPU, while another selects CPU-only nodes, and then deploy any desired number of each node type.