Node Pool
Node Pool is the schema for the Node Pools API.| Field | Type | Description | Default |
|---|---|---|---|
apiversion | string | The version of the Kubernetes API that the Node Pool uses. | compute.coreweave.com/v1alpha1 |
kind | string | The type of resource being defined, such as a Node Pool | Node PoolList |
metadata | ListMeta | See the Kubernetes API documentation for metadata | |
spec | Node PoolSpec | The desired state of the Node Pool. | |
status | Node Pool status | The observed state of the Node Pool. See Node Pool status. |
Node Pool spec
Node PoolSpec defines the desired state of Node Pool.
| Field | Type | Description | Default | Validation |
|---|---|---|---|---|
computeClass | enum (default) | The type of Node Pool. default is used for Reserved instances and On-Demand instances and is the default | default | Optional |
instanceType | string | Instance Type for the Nodes in this Node Pool (Note: instanceType is immutable and unchangeable) | N/A | Required |
targetNodes | integer | The quantity of desired Nodes in the Node Pool (Note: Exactly one of targetNodes and targetRacks must be provided) | N/A | Optional |
targetRacks | integer | The quantity of desired racks in the Node Pool (Note: Exactly one of targetNodes and targetRacks must be provided. targetRacks is only available for NVL72-powered instance types such as GB200 and GB300) | N/A | Optional |
minNodes | integer | The minimum number of TargetNodes allowed by the autoscaler | N/A | Optional |
maxNodes | integer | The maximum number of TargetNodes allowed by the autoscaler | N/A | Optional |
nodeLabels | object (keys:string, values:string) | List of labels associated with the Nodes in the Node Pool | N/A | Optional |
nodeAnnotations | object (keys:string, values:string) | List of annotations associated with the Nodes in the Node Pool | N/A | Optional |
nodeTaints | Taint Array | List of taints associated with the Nodes in the Node Pool | N/A | Optional |
image | Image | The image to use for the Nodes in the Node Pool. (Note: image and gpu are mutually exclusive. If both are set, gpu is ignored.) For updates to take effect, perform a reconfigure reboot. | N/A | Optional |
gpu | Gpu | GPU driver configuration for the Nodes in the Node Pool. (Note: gpu and image are mutually exclusive. If both are set, gpu is ignored.) For updates to take effect, perform a reconfigure reboot. | N/A | Optional |
autoscaling | boolean | Enable or disable cluster autoscaling | false | Optional |
lifecycle.scaleDownStrategy | enum (IdleOnly, PreferIdle) | Options for removing Nodes when scaling down to reach targetNodes. IdleOnly removes only idle Nodes. PreferIdle removes idle Nodes first and then active Nodes if needed. | IdleOnly | Optional |
nodeConfigurationUpdateStrategy.type | enum (Manual, OnSpecUpdate, Always) | Defines how configuration updates get staged onto the Node Pool. See Node configuration update strategies for more information. | OnSpecUpdate | Optional |
prefill | PrefillSpec | Enables proactive replacement of Nodes marked for triage before they are drained. See Prefill. | N/A | Optional |
Node configuration update strategies
| Strategy | Behavior |
|---|---|
| Manual | Updates require user intervention to stage onto the NodePool. You can inspect updates in pendingNodeConfigurations and apply them with the CoreWeave Intelligent CLI. |
| OnSpecUpdate | Updates are staged automatically only if triggered by a direct change to the NodePool spec (for example, modifying the GPU driver version). |
| Always | Updates are staged automatically for both user-initiated spec changes and any available upstream updates. |
Image
Image defines what boot image theNode PoolSpec uses.
| Field | Type | Description | Default | Validation |
|---|---|---|---|---|
kernel | string | Kernel version for this Node Pool | Optional | |
name | string | Name of the image for this Node Pool. If name is set, kernel must be empty and releaseTrain must be stable. | Optional | |
releaseTrain | string | The release channel or track for the image | stable | Optional Enum: [stable latest] |
NVSHMEM + GDRCopy supportIf you need an image with NVSHMEM and GDRCopy support, you can request to use ncore-image v2.10.1 by contacting support. You must apply a patch to
ibgda in your container, enable GDRCopy, then contact support to get access to this new image.GPU
Thegpu field defines the GPU driver configuration for Nodes in the Node PoolSpec.
| Field | Type | Description | Default | Validation |
|---|---|---|---|---|
version | string | The specific GPU driver version to use. Only the major version is specified. | Optional |
Prefill
Theprefill spec defines Node Pool Prefill behavior: proactive provisioning of replacement Nodes before Nodes marked for triage are drained and removed. See Node Pool prefill for an overview.
| Field | Type | Description | Default | Validation |
|---|---|---|---|---|
enabled | boolean | Enable or disable Node Pool Prefill. When enabled, CKS provisions replacement Nodes before draining Nodes marked for triage. Prefill requires autoscaling: false and computeClass: default. It cannot be enabled alongside autoscaling or on Spot Node Pools. | false | Optional |
timeout | Duration | Maximum time to wait for the Node to become idle after the replacement Node is ready (for example, 24h). The Node is Unschedulable from the time it enters triage. After the timeout duration, CKS sets the Node to drain but does not forcibly remove it; CKS removes the Node once it becomes idle. | 24h | Optional |
maxNodes | integer | Maximum number of extra Nodes that can be provisioned for prefill at once (cap on concurrent prefill replacements). Can be between 1 and 4. | 3 | Optional |
Node Pool status
Node Pool status is the observed state of aNode Pool.
| Field | Type | Description | Validation |
|---|---|---|---|
queuedNodes | integer | Number of queued Nodes waiting to be assigned. | Optional Minimum |
inProgress | integer | Number of Nodes that have been assigned, but have not yet fully booted into the cluster | Optional Minimum |
currentNodes | integer | Number of Nodes for this Node Pool present in the cluster | Optional Minimum |
prefillNodes | integer | Number of Nodes in this Node Pool that are present in the cluster and marked for prefill. See Prefill. | Optional Minimum |
rackStatus | RackStatus | Current rack counts for the Node Pool. Only populated for rack-based instance types. | Optional |
nodeProfile | string | NodeProfile is the unique identifier for the active Node configuration | Optional |
pendingNodeConfiguration | object (createdAt: timestamp, nodeProfile: string, summary: string array) | Contains information about the staged Node configuration. This field is only present if NodePool spec changes were made that require a new Config, or if there are updates available such as a new ncore image. | Optional |
nodeConfigurationRevisions | nodeConfigurationRevision Array | Maintains a list of Node configurations that were applied to the NodePool, along with data about each one. | Optional |
conditions | Condition Array | All conditions associated with the Node Pool | Optional |
RackStatus
RackStatus is the observed rack state for a Node Pool. This field is only populated for rack-based instance types.
| Field | Type | Description | Validation |
|---|---|---|---|
target | integer | Number of racks requested for the Node Pool | Optional |
current | integer | Number of racks for this Node Pool present in the cluster | Optional |
queued | integer | Number of racks queued awaiting capacity to join the Node Pool | Optional |
Pending Node configuration
ThependingNodeConfiguration contains information about the configuration that is pending adoption on the NodePool. This configuration is not set as the active configuration without an explicit upgrade.
This field is only present if there are configuration updates available for your NodePool and your nodeConfigurationUpdateStrategy.type is set to Manual or OnSpecUpdate.
For example, if a new GPU driver or Kubernetes version is available, CKS creates a new configuration and sets it as pending on the Node Pool. The summary field provides more information about the available updates. See Manage Node Pool Configuration to promote a pending configuration.
| Field | Type | Description |
|---|---|---|
createdAt | timestamp | When the Node configuration was created |
nodeProfile | string | The unique identifier for the configuration |
summary | string array | A summary of the features for the configuration, such as ncore version, GPU driver version, and Kubernetes version. |
Node configuration revisions
The revisions list holds history of the Node configurations that were applied to the NodePool. That is, configurations that were at one point thestatus.nodeProfile.
The list is sorted by creation timestamp, and can be used as a reference to roll back to a previous Node configuration if desired. See Manage Node Pool Configuration to roll back to a previous configuration.
| Field | Type | Description |
|---|---|---|
activeNodes | integer | The count of Nodes in the NodePool that are booted into this configuration |
createdAt | timestamp | The creation timestamp for the configuration |
nodeProfile | string | The unique identifier for the configuration |
disabled | boolean | Indicates if the configuration is blocked from being applied to more Nodes. A configuration gets disabled if it experiences successive boot failures. Contact Support if this field is set to true. |
summary | string array | A summary of the features for the configuration, such as ncore version, GPU driver version, and Kubernetes version. |
Conditions
Node Pools and the Nodes within them report status through conditions.Node conditions
CoreWeave sets the following conditions on Nodes in a Node Pool.| Condition Name | Type | Description |
|---|---|---|
CWActive | bool | true if the Node is active. If false, and the Node Pool is being scaled down, the Node may be selected for removal. |
CWRegistered | bool | true if the Node is registered. If false, the Node is not registered. Nodes are registered when they enter a customer cluster as part of the Node lifecycle. |
CWNodeRemoval | bool | true if the Node is pending removal. |
Prefill | string (reason) | Set on Nodes in the prefill flow. Use the condition’s reason and message to interpret state. LastTransitionTime indicates when the reason last changed; LastHeartbeatTime indicates when the condition was last processed. See Prefill condition reasons. |
Prefill condition reasons
When Node Pool Prefill is enabled, thePrefill condition on a Node uses the following reasons:
| Reason | Description |
|---|---|
AwaitingReplacement | Node is marked for prefill; a replacement Node is being provisioned. |
AwaitingIdleTimeout | Replacement Node is in the cluster. CKS waits up to the idle timeout (spec.prefill.timeout) for this Node to become idle, then removes it. |
TimeoutExceeded | The Node did not become idle within the idle timeout (spec.prefill.timeout), so CKS sets the Node to drain. CKS does not forcibly remove the Node; CKS removes it once it becomes idle. |
Node Pool conditions
CoreWeave sets the following conditions on a Node Pool after the Node Pool resource has been created.Condition: Validated
This condition answers the question: “Is the Node Pool configuration valid?” It has three possible statuses:
| Status | Description |
|---|---|
Valid | The Node Pool configuration is correct. |
Invalid | The Node Pool configuration has errors, such as an unsupported instance type or incorrect Node affinity. |
InternalError | A system error occurred during validation, so the Node Pool couldn’t be fully checked. |
Condition: AtTarget
This condition shows whether the Node Pool has the expected number of active Nodes. The AtTarget condition has five possible values:
| Status | Description |
|---|---|
TargetMet | The Node Pool has exactly the number of Nodes specified in the target. |
PendingDelete | The Node Pool is being deleted, and its Nodes are removed. |
OverTarget | More Nodes exist than the target. Extra Nodes are removed using the ScaleDownStrategy. |
UnderTarget | Fewer Nodes exist than the target. New Nodes are created if resources are available. |
InternalError | A system error occurred while retrieving Node information for the Node Pool. |
Condition: NodesRemoved
The condition NodesRemoved is applied to a Node Pool when it is pending deletion and Nodes are in the process of being removed. Once all Nodes are removed, the Node Pool is deleted. This response indicates one of the following conditions:
| Status | Meaning |
|---|---|
Complete | All Nodes have been removed from the Node Pool, and the Node Pool’s deletion is imminent. |
Pending | Nodes are in the process of being removed from the Node Pool. |
InternalError | An internal error has occurred while trying to remove Nodes from the Node Pool. |
Condition: Capacity
The Capacity condition indicates whether there is enough capacity available for the requested number of Nodes in the requested instance type. This response indicates one of the following conditions:
| Status | Meaning |
|---|---|
Sufficient | All Nodes have been removed from the Node Pool, and the pool’s deletion is imminent. |
Partial | Partial capacity exists in the designated Region to fulfill the request, but not to completely fulfill it. |
NoneAvailable | No Nodes are available of the requested type in the given Region. |
NoneAvailableNodeAffinity | No Nodes are available for the requested instance type due to Affinity constraints. |
PartialNodeAffinity | Partial capacity is available to fulfill the requested targetNodes. Full capacity is not available due to Affinity constraints. |
QueuedAwaitingCapacity | Request for additional Nodes has been queued and is waiting for additional capacity to become available. |
InternalError | An internal error has occurred while attempting to check capacity. |
Condition: Quota
Quota has four statuses.
| Status | Meaning |
|---|---|
Under | The organization is under quota for the Node Pool’s instance type. |
Over | The organization is over quota for the Node Pool’s instance type. |
NotSet | The organization does not have a quota set for the Node Pool’s instance type. |
InternalError | An internal error has occurred attempting to check the quota. |
Condition: NodeReconfigurationRequired
This condition answers the question: “Do any of my Nodes need to be reconfigured to boot into the active config (status.nodeProfile)?” It has three possible statuses:
| Status | Description |
|---|---|
StagedNodeConfig | The Node Pool has Nodes that need to be reconfigured to boot into the active config. |
AllNodesUpToDate | All of the Nodes in the Node Pool are booted into the active config. |
InternalError | A system error occurred fetching the active configuration for Nodes. It’s also possible there was a failure staging the Node Pool’s active Node configuration onto outdated Nodes. See the condition message for more information. |
Healthy Node Pools
A new Node Pool in a healthy state looks like this when described:new node pool, showing new conditions with 'describe'
Example output
Events
The Node Pool controller emits the following Kubernetes Events.| Event Name | Resource | Description |
|---|---|---|
CWDrainNode | NodePool | Fired when a Node is being drained. |
CWInstanceTypeNotInZone | NodePool | Fired when a Node Pool has an instance type not in its Zone. |
CWInsufficientCapacity | NodePool | Fired when there is not sufficient capacity for a Node Pool. |
CWInvalidInstanceType | NodePool | Fired when a Node Pool contains an invalid instance type. |
CWNodeAssigned | NodePool | Fired when a Node is assigned to a Node Pool. |
CWNodeDeleted | NodePool | Fired to indicate a Node has been deleted. |
CWNodeDeletionRequestSuccess | NodePool | Fired when Node Pool Operator receives a request to delete a Node. |
CWNodeDeliverFail | NodePool | Fired when Node allocation to a Node Pool fails due to misconfigured Node Pool settings or internal issues. |
CWNodePoolCreated | NodePool | Fired when a Node Pool is created. |
CWNodePoolDeleted | NodePool | Fired when a Node Pool is deleted. |
CWNodePoolDisabled | NodePool | Fired when the Node Profile assigned to the Node Pool is marked as disabled, usually after repeated Node delivery failures. The Node Pool can’t schedule additional Nodes until Support re-enables the Node Profile or a new one is generated. |
CWNodePoolMetadata | NodePool | Fired when metadata is updated for a Node Pool. |
CWNodePoolNodesRemoved | NodePool | Fired when Nodes are removed from the Node Pool. |
CWNodePoolNodesRemoveError | NodePool | Fired when an error occurs during Node removal. |
CWNodePoolNodesRequestFailed | NodePool | Fired when an error is returned when updating the Node Pool. |
CWNodePoolQuotaCheckFailed | NodePool | Fired when there is an internal error checking the quota. |
CWNodePoolRemoveNodes | NodePool | Fired when attempting to scale down a Node Pool. |
CWNodePoolScaledDown | NodePool | Fired when a Node Pool is scaled down. |
CWNodePoolScaledUp | NodePool | Fired when a Node Pool is scaled up. |
CWNodeRegistered | Node, NodePool | Fired when Node registration succeeds. |
CWNodeRegistrationFailed | Node | Fired when Node registration fails. See the message for additional details. |
CWNodeRequestQueued | NodePool | Fired when a request is submitted to add Nodes to a Node Pool. |
CWOverQuota | NodePool | Fired when the quota is insufficient for the Node Pool’s targetNodes. |
CWNodeCordoned | Node, NodePool | Fired when a Node is being cordoned as part of internal automation. |
CWNodeUncordoned | Node, NodePool | Fired when a Node is being uncordoned as part of internal automation. |
CWNodeMarkedPrepareForTerminate | Node, NodePool | Fired when a Node has been sent the signal to prepare for termination. See event message for details. |
CWNodeDraining | Node, NodePool | Fired when a Node is being drained as part of internal automation. |
CWNodeDrainingPDBViolation | Node, NodePool | Fired when a Node is being drained and certain pods cannot be evicted due to a Pod Disruption Budget. |
CWNodeDrainingForceDelete | Node, NodePool | Fired when a Node is being drained as part of internal automation and pods were deleted ungracefully. |
CWNodePreparedForTerminate | Node, NodePool | Fired when a Node is prepared for termination. |
CWNodeTainted | Node, NodePool | Fired when a Node is tainted as part of internal automation. |
CWNodeUntainted | Node, NodePool | Fired when a Node is untainted as part of internal automation. |
CWNodeConfigStaged | NodePool | Fired when the active Node Config (status.nodeProfile) for the NodePool has been updated. This might be a result of staging the pendingNodeConfiguration, new user-initiated changes getting staged, or rolling back to a previous configuration. |
CWPendingNodeConfig | NodePool | Fired when the status.pendingNodeConfiguration is updated on the NodePool. |
CWConfigStaged | Node | Fired when a new Node configuration is staged on the Node by setting it as the desired NodeProfile during Node boot. |
CWReconfigureRebooted | Node | Fired after the Node has been reconfigure rebooted and is determined to be up to date with the staged NodeProfile of the NodePool. |
CWNodePoolNodeConfigUpdatePending | NodePool | Fired when a Node Config update is pending due to Nodes being mid-delivery. The update completes once Nodes have booted into the cluster. |
CWNodePoolNodeConfigUpdated | NodePool | Fired when the active Node Config (status.nodeProfile) for the NodePool has been updated. This might be a result of promoting the pendingNodeConfiguration or rolling back to a previous configuration. |
CWNodePoolNoPendingConfiguration | NodePool | Fired when a user attempts to promote the status.pendingNodeConfiguration to become the active configuration (status.nodeProfile), but there is no staged configuration. |
CWNodePoolNodeConfigRollbackFailed | NodePool | Fired when there was an error rolling back the active Node configuration for the NodePool to a previous config. |
CWNodePoolNodeConfigUpdateConflict | NodePool | Fired when a user simultaneously attempts to upgrade to the pendingNodeConfig and rollback to a previous one. |
CWNodeConfigStagingFailed | Node | Fired when CKS attempted to stage the NodePool’s active configuration on an existing Node in the cluster to prepare it for a reconfigure reboot, but was not able to do so. It keeps retrying until successful. |
CWManualUpgradeTrigger | NodePool | Fired when a user upgrades a NodePool to the pending configuration. |
CWManualRollbackTrigger | NodePool | Fired when a user rolls back a NodePool to a previous configuration. |