The NodeSet CRD represents a set of Slurm nodes. It manages the Kubernetes Pods that contain the Slurm nodes and their associated Kubernetes Nodes. The NodeSet deploys like a DaemonSet, with a 1:1 mapping between Pods and Kubernetes Nodes. The NodeSet controller is responsible for creating, updating, and deleting the Pods that belong to the NodeSet, and handling their Node assignment.Documentation Index
Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
Use this file to discover all available pages before exploring further.
Pod creation and scaling
A NodeSet will scale up to the desired number of replicas, if possible. It may be limited by the number of available Kubernetes Nodes in the cluster that match the NodeSet’s affinity, tolerations, and resource requests. The status indicates the total number of feasible Nodes, which may be less than the requested amount.Node selection
When selecting a Slurm node to assign a Pod, the NodeSet evaluates the Kubernetes Nodes in the cluster against the affinity, tolerations, and resource requests. Additionally, the NodeSet controller makes sure that a node may only be used by one NodeSet at a time. The NodeSet controller continues to evaluate Nodes until the desired number of new Pods is reached, or the list of Nodes is exhausted. The selected Nodes are then added to the NodeSet.The NodeSet only evaluates the required portion of the Node affinity. No other sections of affinity are used.
Pod naming
The Pods created by the NodeSet are named deterministically by their Node assignment. This makes the Slurm node list consistent, and the association between Kubernetes Nodes and Slurm nodes easy to determine. When a suitable Kubernetes Node is found, the NodeSet controller creates the Pod with the naming convention<prefix>-<name>.
<prefix>is set to.Spec.PodNamePrefixif present, or the NodeSet name if not.<name>is generated from the last two octets of the Node’s internal IPv4 address, each zero-padded to 3 digits. For example, a Node with IP10.174.12.2results in a<name>of012-002.
Using zero-padded numbers facilitates use of the Slurm range syntax. Because CoreWeave’s environment doesn’t require additional octets to create unique names, only the last two octets are used.
Pod creation
When a Pod is created for a NodeSet, the Pod’s Node affinity is overridden to match the Node name that the Pod is named after, forcing the Pod to schedule for that specific Node. Additionally, the Pods will have the toleration added for thesunk.coreweave.com/lock key.
When a NodeSet Pod is added to a Node, the Node is added to one of the NodeSlices associated with the NodeSet. Because a NodeSet Pod is intended to occupy a full Node when running, only one NodeSet will ever schedule a Pod to a given Node at a time.
Safe deletion
When a NodeSet is scaled down, or when a rolling upgrade is performed, a safe deletion mechanism removes the Pods without interrupting any running Slurm jobs. The deletion process exchanges information between the cluster-wide NodeSet controller and the individual Slurm cluster Syncers through theSlurmDelete condition on the Pod.
When the NodeSet seeks to delete a Pod for scaling or rolling update purposes, the NodeSet sets the deletion condition to true with the reason ScalingDeleteScheduled or RollingDeleteScheduled. The Syncer responsible for that NodeSet then drains the Node in the associated Slurm cluster and updates the Pod’s state. After the NodeSet controller sees the Pod is both drained and idle, it deletes the Pod. See Syncer for more details about the Pod conditions visible to the Syncer.
When a Pod is being deleted due to scaling, if the NodeSet was scaled down and then back up before the actual Pod deletion happened, the NodeSet controller will remove the label and halt the deletion process. This also removes the drain condition from the Slurm node.
Forced deletion
When using the NodeSet optionsoperator.config.operator.nodeSet.forceScalingDeleteKnownConditionTimeout and operator.config.operator.nodeSet.forceScalingDeleteUnknownConditionTimeout, the NodeSet will
forcibly delete pods after the specified timeout is reached. By default, both options are set to 0, disabling the feature. When enabled, these options take effect during NodeSet scale-down events. Pods will be deleted after the configured timeout, even if a Slurm job is still running on the pod.
forceScalingDeleteKnownConditionTimeoutapplies when the Slurm state is known for the pod, which is typical in a healthy cluster. For example, if this option is set to 10 seconds and a scale-down event occurs, the pod will be forcibly removed from Slurm 10 seconds later, even if a job is still running.forceScalingDeleteUnknownConditionTimeoutfunctions the same way but applies when the Slurm state is unknown. This can occur if the cluster is experiencing issues.
Scaling down
NodeSet scaling down follows a priority-based ordering, which is enabled by default via theoperator.config.operator.nodeSet.scaleDownPriorityOrdering option. Each Pod is ranked and assigned a deletion priority, which determines the order of deletion. A Pod with lower priority value will be deleted first. The priority is determined by the Pod’s readiness, Slurm running state, drained state, and pending delete status. In summary, Pods that are not ready are deleted first, followed by idle and drained Pods, and finally Pods that are running workloads.
Below is the ranking used to determine the priority value:
- Not-ready Pods are the lowest priority and are the first to be deleted.
- Idle Pods that are
drainedand have thedeletecondition - Idle Pods that are
drainedbut do NOT have thedeletecondition - Idle Pods that are not
drained - Pods running workloads that are
drainedand have thedeletecondition - Pods running workloads that are
drainedand do NOT have thedeletecondition - Pods running workloads that are not
drained
operator.config.operator.nodeSet.scaleDownPriorityOrdering option, the NodeSet scales down. Newer Pods are deleted first, following the Safe Deletion process outlined previously.
Regardless of the option chosen, when a Pod is removed, the associated NodeSlices are updated to no longer contain that Node.
Rolling upgrade
A NodeSet can be upgraded either by deletion or via a rolling update. A rolling update is the preferred option because it automates the upgrade using Safe Deletion. To limit the maximum number of Pods that can be unavailable during a rolling update, configure themaxUnavailable value to a percentage or a number. This value also includes any Pods with a pending deleting status.
The rolling update uses a hash of the .spec.template.spec value to determine if a Pod is up to date. The Pod’s controller-revision-hash label stores the associated value used when the Pod was created. If the value doesn’t match the current NodeSet value, then the Pod is considered to be outdated.
The rolling update prioritizes Pods by:
- Non-ready Pods
- Idle Pods that are not running Slurm workloads
- Remaining Pods
Status
The NodeSet contains status information about the NodeSet, including:| Field | Metric | Definition |
|---|---|---|
currentNumberScheduled | kube_sunk_nodeset_current | Number of Nodes that the NodeSet is scheduled on. |
numberMisscheduled | kube_sunk_nodeset_misscheduled | Number of Nodes having the NodeSet’s Pods that should no longer be scheduled on. |
desiredNumberScheduled | kube_sunk_nodeset_desired | Desired number of Nodes to schedule on. If .spec.replicas is set, this matches that value. Otherwise, it matches numberFeasible. |
updatedNumberScheduled | kube_sunk_nodeset_up_to_date | Number of Nodes having scheduled the latest version of the NodeSet’s Pod. |
numberFeasible | kube_sunk_nodeset_feasible | Number of Nodes the NodeSet could schedule on (inclusive of those already scheduled). |
numberReady | kube_sunk_nodeset_ready | Number of Nodes having the NodeSet’s Pod in a Ready state. |
numberUnavailable | kube_sunk_nodeset_unavailable | Number of Nodes scheduled but do not have the NodeSet’s Pod in a Ready state. |
numberRunning | kube_sunk_nodeset_running | Number of Nodes actively running Slurm jobs. |
numberDrain | kube_sunk_nodeset_drained | Number of Nodes that are drained in Slurm. |