NodeSet - CoreWeave Docs

The NodeSet CRD represents a set of Slurm nodes. It manages the Kubernetes Pods that contain the Slurm nodes and their associated Kubernetes Nodes. The NodeSet deploys like a DaemonSet, with a 1:1 mapping between Pods and Kubernetes Nodes. The NodeSet controller is responsible for creating, updating, and deleting the Pods that belong to the NodeSet, and handling their Node assignment.

Pod creation and scaling

A NodeSet will scale up to the desired number of replicas, if possible. It may be limited by the number of available Kubernetes Nodes in the cluster that match the NodeSet’s affinity, tolerations, and resource requests. The status indicates the total number of feasible Nodes, which may be less than the requested amount.

Node selection

When selecting a Slurm node to assign a Pod, the NodeSet evaluates the Kubernetes Nodes in the cluster against the affinity, tolerations, and resource requests. Additionally, the NodeSet controller makes sure that a node may only be used by one NodeSet at a time. The NodeSet controller continues to evaluate Nodes until the desired number of new Pods is reached, or the list of Nodes is exhausted. The selected Nodes are then added to the NodeSet.

The NodeSet only evaluates the required portion of the Node affinity. No other sections of affinity are used.

For more information, see Configure compute nodes.

Pod naming

The Pods created by the NodeSet are named deterministically by their Node assignment. This makes the Slurm node list consistent, and the association between Kubernetes Nodes and Slurm nodes easy to determine. When a suitable Kubernetes Node is found, the NodeSet controller creates the Pod with the naming convention <prefix>-<name>.

<prefix> is set to .Spec.PodNamePrefix if present, or the NodeSet name if not.
<name> is generated from the last two octets of the Node’s internal IPv4 address, each zero-padded to 3 digits. For example, a Node with IP 10.174.12.2 results in a <name> of 012-002.

Using zero-padded numbers facilitates use of the Slurm range syntax. Because CoreWeave’s environment doesn’t require additional octets to create unique names, only the last two octets are used.

Pod creation

When a Pod is created for a NodeSet, the Pod’s Node affinity is overridden to match the Node name that the Pod is named after, forcing the Pod to schedule for that specific Node. Additionally, the Pods will have the toleration added for the sunk.coreweave.com/lock key. When a NodeSet Pod is added to a Node, the Node is added to one of the NodeSlices associated with the NodeSet. Because a NodeSet Pod is intended to occupy a full Node when running, only one NodeSet will ever schedule a Pod to a given Node at a time.

Safe deletion

When a NodeSet is scaled down, or when a rolling upgrade is performed, a safe deletion mechanism removes the Pods without interrupting any running Slurm jobs. The deletion process exchanges information between the cluster-wide NodeSet controller and the individual Slurm cluster Syncers through the SlurmDelete condition on the Pod. When the NodeSet seeks to delete a Pod for scaling or rolling update purposes, the NodeSet sets the deletion condition to true with the reason ScalingDeleteScheduled or RollingDeleteScheduled. The Syncer responsible for that NodeSet then drains the Node in the associated Slurm cluster and updates the Pod’s state. After the NodeSet controller sees the Pod is both drained and idle, it deletes the Pod. See Syncer for more details about the Pod conditions visible to the Syncer.

When a Pod is being deleted due to scaling, if the NodeSet was scaled down and then back up before the actual Pod deletion happened, the NodeSet controller will remove the label and halt the deletion process. This also removes the drain condition from the Slurm node.

Forced deletion

When using the NodeSet options operator.config.operator.nodeSet.forceScalingDeleteKnownConditionTimeout and operator.config.operator.nodeSet.forceScalingDeleteUnknownConditionTimeout, the NodeSet will forcibly delete pods after the specified timeout is reached. By default, both options are set to 0, disabling the feature. When enabled, these options take effect during NodeSet scale-down events. Pods will be deleted after the configured timeout, even if a Slurm job is still running on the pod.

forceScalingDeleteKnownConditionTimeout applies when the Slurm state is known for the pod, which is typical in a healthy cluster. For example, if this option is set to 10 seconds and a scale-down event occurs, the pod will be forcibly removed from Slurm 10 seconds later, even if a job is still running.
forceScalingDeleteUnknownConditionTimeout functions the same way but applies when the Slurm state is unknown. This can occur if the cluster is experiencing issues.

Scaling down

NodeSet scaling down follows a priority-based ordering, which is enabled by default via the operator.config.operator.nodeSet.scaleDownPriorityOrdering option. Each Pod is ranked and assigned a deletion priority, which determines the order of deletion. A Pod with lower priority value will be deleted first. The priority is determined by the Pod’s readiness, Slurm running state, drained state, and pending delete status. In summary, Pods that are not ready are deleted first, followed by idle and drained Pods, and finally Pods that are running workloads. Below is the ranking used to determine the priority value:

Not-ready Pods are the lowest priority and are the first to be deleted.
Idle Pods that are drained and have the delete condition
Idle Pods that are drained but do NOT have the delete condition
Idle Pods that are not drained
Pods running workloads that are drained and have the delete condition
Pods running workloads that are drained and do NOT have the delete condition
Pods running workloads that are not drained

When disabling the operator.config.operator.nodeSet.scaleDownPriorityOrdering option, the NodeSet scales down. Newer Pods are deleted first, following the Safe Deletion process outlined previously. Regardless of the option chosen, when a Pod is removed, the associated NodeSlices are updated to no longer contain that Node.

Rolling upgrade

A NodeSet can be upgraded either by deletion or via a rolling update. A rolling update is the preferred option because it automates the upgrade using Safe Deletion. To limit the maximum number of Pods that can be unavailable during a rolling update, configure the maxUnavailable value to a percentage or a number. This value also includes any Pods with a pending deleting status. The rolling update uses a hash of the .spec.template.spec value to determine if a Pod is up to date. The Pod’s controller-revision-hash label stores the associated value used when the Pod was created. If the value doesn’t match the current NodeSet value, then the Pod is considered to be outdated. The rolling update prioritizes Pods by:

Non-ready Pods
Idle Pods that are not running Slurm workloads
Remaining Pods

Following this order allows the cluster to update as many Pods as possible before waiting for the remaining Pods to become drained and idle.

Status

The NodeSet contains status information about the NodeSet, including:

Field	Metric	Definition
`currentNumberScheduled`	`kube_sunk_nodeset_current`	Number of Nodes that the NodeSet is scheduled on.
`numberMisscheduled`	`kube_sunk_nodeset_misscheduled`	Number of Nodes having the NodeSet’s Pods that should no longer be scheduled on.
`desiredNumberScheduled`	`kube_sunk_nodeset_desired`	Desired number of Nodes to schedule on. If `.spec.replicas` is set, this matches that value. Otherwise, it matches `numberFeasible`.
`updatedNumberScheduled`	`kube_sunk_nodeset_up_to_date`	Number of Nodes having scheduled the latest version of the NodeSet’s Pod.
`numberFeasible`	`kube_sunk_nodeset_feasible`	Number of Nodes the NodeSet could schedule on (inclusive of those already scheduled).
`numberReady`	`kube_sunk_nodeset_ready`	Number of Nodes having the NodeSet’s Pod in a Ready state.
`numberUnavailable`	`kube_sunk_nodeset_unavailable`	Number of Nodes scheduled but do not have the NodeSet’s Pod in a Ready state.
`numberRunning`	`kube_sunk_nodeset_running`	Number of Nodes actively running Slurm jobs.
`numberDrain`	`kube_sunk_nodeset_drained`	Number of Nodes that are drained in Slurm.

SUNK

Documentation Index

​Pod creation and scaling

​Node selection

​Pod naming

​Pod creation

​Safe deletion

​Forced deletion

​Scaling down

​Rolling upgrade

​Status

Pod creation and scaling

Node selection

Pod naming

Pod creation

Safe deletion

Forced deletion

Scaling down

Rolling upgrade

Status