Skip to main content
Slurm login nodes let you access your Slurm cluster. Slurm compute nodes are the nodes within the cluster where jobs run, and they handle the resources used to run jobs submitted to Slurm. With SUNK, you can create flexible compute node definitions to meet the resource requirements of your workloads. This guide describes the methods for defining compute nodes, so you can tailor each NodeSet to the hardware and scheduling needs of your jobs.
In SUNK, Slurm nodes run in Kubernetes Pods. These aren’t the same as Kubernetes Nodes, which are the worker machines that run the Pods. To distinguish between the two, this documentation capitalizes Kubernetes Nodes, while Slurm nodes aren’t capitalized.

Access Slurm compute nodes

After you access your Slurm cluster through the Slurm login node, you can interact with the Slurm compute nodes with standard Slurm commands. You don’t need to directly access a Slurm compute node. Use Slurm commands, such as srun, sbatch, or salloc, to run and manage jobs on Slurm compute nodes.
Avoid directly accessing Slurm compute nodes through SSH to run tasks. Bypassing Slurm can interfere with currently running jobs and may cause nodes to drain unintentionally, leading to temporary loss of resources. Use SSH to Slurm compute nodes only to debug existing jobs on the nodes.

The manifest

The foundation for defining compute nodes is a YAML manifest, which outlines the resources and configurations for each node type. The sections that follow reference the fields shown in this example, so use it as a map for the rest of this guide. The compute: section looks like this.
compute:
  # See "Global options" below to learn more.
  volumeMounts: []
  volumes: []
  s6: {}
  pyxis:
  partitions:

  # Node definitions. Multiple node definitions are allowed, but
  # only those `enabled: true` will be deployed.
  nodes:
    # Another node definition.
    my-node-def:
      enabled: true
      replicas: 1
      staticFeatures:
        - foo
        - bar
      dynamicFeatures:
        node.coreweave.cloud/class: {}
        gpu.nvidia.com/class: {}
      image:
        repository: registry.gitlab.com/example

      env:
        - name: example
          value: "1"

      gresGpu: h100:8
      config:
        weight: 1
      resources:
        limits:
          memory: 960Gi
          sunk.coreweave.com/accelerator: "8"
          rdma/ib: "1"
        requests:
          cpu: "110"
          memory: 960Gi
          sunk.coreweave.com/accelerator: "8"

Global options

Global options apply to every compute node deployed from this manifest. The following sections describe each global option shown in the preceding YAML example.

compute.volumeMounts

The compute.volumeMounts parameter declares a list of additional volumes to mount within the primary container of the node in addition to the chart global.volumeMounts. For example:
compute:
  volumeMounts:
    - name: my-pvc
      mountPath: /mnt/my-pvc
  • Entries that share the same mountPath as a globally defined mount override the mount.
  • SUNK also adds these volumeMounts to the login node primary container.

compute.volumes

The compute.volumes parameter declares a list of additional volumes to attach to the Pod for the compute node. If you use persistent volume claims, usually use ReadWriteMany access mode. See Share storage across Slurm nodes for more information. For example:
compute:
  volumes:
    - name: my-pvc
      persistentVolumeClaim:
        claimName: my-pvc
  • Entries that share the same name as a globally defined mount override the volume.
  • SUNK also adds these volumes to the login node Pods.

compute.s6

The compute.s6 parameter lets SUNK run custom s6 scripts on compute nodes, either as oneshot or longrun jobs. For example:
compute:
  s6:
    packages:
      type: oneshot
      timeoutUp: 0
      timeoutDown: 0
      script: |
        #!/usr/bin/env bash
        apt -y update
        apt -y install nginx
    nginx:
      type: longrun
      timeoutUp: 0
      timeoutDown: 0
      script: |
        #!/usr/bin/env bash
        nginx -g "daemon off;"
See Run custom scripts with s6 for more information.

compute.pyxis

The compute.pyxis parameter has multiple options:
ParameterPurpose
compute.pyxis.enabledEnables the pyxis container.
compute.pyxis.mountHomeEnables ENROOT-MOUNT_HOME for the pyxis container to mount the home directory.
compute.pyxis.remapRootEnables ENROOT_REMAP_ROOT for the pyxis container to remap the root user.
compute.pyxis.securityContext.capabilities.addAdds capabilities to the pyxis container. "SYS_ADMIN" is required if you use Pyxis.
For example:
compute:
  pyxis:
    enabled: true
    mountHome: true
    remapRoot: true
    securityContext:
      capabilities:
        add: ["SYS_ADMIN"]

compute.partitions

The compute.partitions parameter defines Slurm partitions. A Slurm partition is a logical grouping of compute nodes (servers) within the Slurm cluster that organizes nodes by characteristics such as memory size, CPU type, or GPU availability. When a user submits a job to a Slurm-managed HPC cluster, they specify the partition where the job should run. The Slurm scheduler then assigns the job to an available node within that partition. Partitions can have different configurations and policies, such as time limits for jobs, user access restrictions, or priority levels. A related option is compute.autoPartition.enabled, which, if true (the default), creates a partition within Slurm for each NodeSet defined in compute.nodes. The partition name matches the name of the nodes section.

Other global options

Besides the options shown in the preceding compute example, several others apply globally to all compute nodes.
ParameterPurpose
compute.generateTopologyIf true, generate the network topology.
compute.initialState and compute.initialStateReasonThe initial State for the nodes when they join the Slurm cluster, generally drain or idle, and the reason for setting that state. These can also be applied as node-specific options.
compute.maxUnavailableSets the maximum unavailability of the compute nodes during a rolling update. Can be a percentage or a number.
compute.ssh.enabledWhen enabled, the Slurm compute nodes have SSH available.

Node-specific options

Besides the preceding global options, you can set options on each named node definition to customize a single NodeSet without affecting others. Many options are available for each named node definition. For reference, see my-node-def in the preceding YAML example, which shows many of the available options.
  • node.enabled: If true, SUNK deploys compute nodes with this definition. You can declare multiple definitions, but SUNK deploys only those with enabled: true.
  • node.replicas: Specifies the desired number of Slurm nodes (Kubernetes Pods) of this type that the NodeSet attempts to create. This is a maximum value, because the number of desired Pods can be greater than the number of available Pods. To change the number of replicas for a running Slurm cluster, replace [NODESET-NAME] with the name of your NodeSet and [N] with the desired number of replicas:
kubectl scale nodeset [NODESET-NAME] --replicas=[N]
  • node.definitions: A list of other node definitions to include in this definition. See Custom node definitions to learn how to create custom definitions.
  • node.staticFeatures: Static Slurm node feature flags. Feature flags are strings that Slurm adds to the Slurm nodes, where they’re available for use when scheduling Slurm jobs. For example, to schedule a job only on nodes with the feature really-fast:
srun -C really-fast hostname
Here’s an example of how it looks within Slurm.
NodeName=h100-092-02 Arch=x86_64 CoresPerSocket=32
           CPUAlloc=110 CPUEfctv=128 CPUTot=128 CPULoad=0.56
           AvailableFeatures=h100-pci4,pci-4,cu120,gpu,infiniband,sharp
           ActiveFeatures=h100-pci4,pci-4,cu120,gpu,infiniband,sharp
  • node.dynamicFeatures: Dynamic Slurm node features from Kubernetes Node labels. This specifies a map of labels to use as additional feature flags within Slurm. The value for each map key is {} because there’s no further configuration at this time.
  • node.image: Specifies which Docker image repository to use to pull this node’s image. See Custom Images to learn more about how to build custom SUNK images.
  • node.env: Sets extra environment variables to expose in the compute nodes.
  • node.gresGpu: Sets the Slurm Generic Resource Scheduling value for the gpu GresType. This describes the type and number of GPU Generic Resources for this Slurm node type.
  • node.config: Adds additional config options to the slurmd startup used during dynamic node registration. The features and gres options are already set. See Node Parameters and Node Configuration for more details on the options and values.
  • node.resources: Sets the Kubernetes Compute resource limits and requests.
  • node.realMemory: Sets per-node RealMemory limits in Slurm config. By default, SUNK uses the node.resources.limits.memory value divided by 1Mi to set the Slurm RealMemory value. This option overrides that default behavior.
  • node.affinity: Sets the Kubernetes Node affinities, which ensure that the node is scheduled with a specific GPU model.
  • node.initialState and node.initialStateReason: The initial State for the nodes when they join the Slurm cluster, generally drain or idle, and the reason for setting that state. You can also apply these as a general option for the cluster.
  • node.volumeMounts: Additional per node definition volumeMounts to add to the primary container, same format as compute.volumeMounts. Mounts that match on mountPath override those set at the higher level.
  • node.volumes: Additional per node definition volumes to add to the Pod, same format as compute.volumes. Volumes that match on name override those set at the higher level.
  • node.containers: Additional per node containers (for example, sidecars) to add to the Pod. Additional configuration for these containers (such as Secrets and ConfigMaps) must be in the Slurm namespace.
  • node.dnsPolicy: Adjusts the dnsPolicy for each node.
  • node.dnsConfig: Adjusts the dnsConfig for each node.

Custom node definitions

When several node definitions share configuration, you can factor the shared parts into reusable layers rather than repeat them. Node definitions can reference other node definitions to include or overlay values. You can define these “layers” in the same values.yaml file, or in separate files. As shown in the prior example, a node definition named reservation-id exists:
compute:
  nodes:
    reservation-id:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: node.coreweave.cloud/reserved
                    operator: In
                    values:
                      - [RESERVATION-ID]
The my-node-def definition includes that layer.
compute:
  nodes:
    my-node-def:
      definitions:
      - reservation-id
You can store custom layers as separate values files. You can use any key defined under compute.nodes, even if that key is another file, by specifying multiple values files in a defined order on the command line. For example, consider a custom-compute-defs-values.yaml file that only has a compute.nodes section with custom layers defined. The values.yaml file can use those definitions as long as you use both value files when you deploy, like so:
helm install slurm coreweave/slurm -f custom-compute-defs-values.yaml -f values.yaml

Mixing CPU and GPU node types

A single Slurm cluster often needs to serve workloads with different hardware requirements. You can mix multiple Slurm node types by defining multiple NodeSets in different blocks under compute.nodes. Each NodeSet can have its own resources and affinities that specify a single type of node. For example, you can create a NodeSet that selects a particular type of GPU, while another selects CPU-only nodes, and then deploy any desired number of each node type.
Last modified on May 27, 2026