Node Pool Reference
Node Pools use the following API schema definitions:
Node Pool
Node Pool is the schema for the Node Pools API.
| Field | Type | Description | Default |
|---|---|---|---|
apiversion | string | The version of the Kubernetes API that the Node Pool uses. | compute.coreweave.com/v1alpha1 |
kind | string | The type of resource being defined, such as a Node Pool | Node PoolList |
metadata | ListMeta | See the Kubernetes API documentation for metadata | |
spec | Node PoolSpec | The desired state of the Node Pool. | |
status | Node PoolStatus | The observed state of the Node Pool. See Node PoolStatus. |
Node Pool Spec
Node PoolSpec defines the desired state of Node Pool.
| Field | Type | Description | Default | Validation |
|---|---|---|---|---|
computeClass | enum (default) | The type of Node Pool. default is used for Reserved instances and On-Demand instances and is the default | default | Optional |
instanceType | string | Instance Type for the Nodes in this Node Pool (Note: instanceType is immutable and unchangeable) | N/A | Required |
targetNodes | integer | The quantity of desired Nodes in the Node Pool | N/A | Required |
minNodes | integer | The minimum number of TargetNodes allowed by the autoscaler | N/A | Optional |
maxNodes | integer | The maximum number of TargetNodes allowed by the autoscaler | N/A | Optional |
nodeLabels | object (keys:string, values:string) | List of labels associated with the Nodes in the Node Pool | N/A | Optional |
nodeAnnotations | object (keys:string, values:string) | List of annotations associated with the Nodes in the Node Pool | N/A | Optional |
nodeTaints | Taint Array | List of taints associated with the Nodes in the Node Pool | N/A | Optional |
image | Image | The image to use for the Nodes in the Node Pool. | N/A | Optional |
autoscaling | boolean | Enable or disable cluster autoscaling | false | Optional |
lifecycle.scaleDownStrategy | enum (IdleOnly, PreferIdle) | Options for removing Nodes when scaling down to reach targetNodes. IdleOnly removes only idle Nodes. PreferIdle removes idle Nodes first and then active Nodes if needed. | IdleOnly | Optional |
Image
Image defines what boot image the Node PoolSpec uses.
Image should be omitted from Node PoolSpec unless directed by CoreWeave support.
| Field | Type | Description | Default | Validation |
|---|---|---|---|---|
kernel | string | Kernel version for this Node Pool | Optional | |
name | string | Name of the image for this Node Pool | Optional | |
releaseTrain | string | The release channel or track for the image | stable | Optional Enum: [stable latest] |
If you need an image with NVSHMEM and GDRCopy support, you can request to use ncore-image v2.10.1 by contacting support. You'll need to apply a patch to ibgda in your container, enable GDRCopy, then contact support to get access to this new image.
See NVSHMEM and GDRCopy for more detailed information.
Node PoolStatus
Node PoolStatus is the observed state of Node Pool.
| Field | Type | Description | Validation |
|---|---|---|---|
inProgress | integer | Number of Nodes that have been assigned, but have not yet fully booted into the cluster | Optional Minimum |
currentNodes | integer | Number of Nodes for this Node Pool present in the cluster | Optional Minimum |
nodeProfile | string | NodeProfile represent string of the NodeProfile | Optional |
conditions | Condition Array | All conditions associated with the Node Pool | Optional |
Conditions
Node Conditions
CoreWeave sets the following conditions on Nodes in a Node Pool.
| Condition Name | Type | Description |
|---|---|---|
CWActive | bool | true if the Node is active. If false, and the Node Pool is being scaled down, the Node may be selected for removal. |
CWRegistered | bool | true if the Node is registered. If false, the Node is not registered. Nodes are registered when they enter a customer cluster as part of the Node lifecycle. |
CWNodeRemoval | bool | true if the Node is pending removal. |
Node Pool Conditions
CoreWeave sets the following conditions on a Node Pool after the Node Pool resource has been created.
Condition: Validated
This condition answers the question: "Is the Node Pool configuration valid?" It has three possible statuses:
| Status | Description |
|---|---|
Valid | The Node Pool configuration is correct. |
Invalid | The Node Pool configuration has errors, such as an unsupported instance type or incorrect Node affinity. |
InternalError | A system error occurred during validation, so the Node Pool couldn't be fully checked. |
Condition: AtTarget
This condition shows whether the Node Pool has the expected number of active Nodes. The AtTarget condition has five possible values:
| Status | Description |
|---|---|
| TargetMet | The Node Pool has exactly the number of Nodes specified in the target. |
| PendingDelete | The Node Pool is being deleted, and its Nodes will be removed. |
| OverTarget | More Nodes exist than the target. Extra Nodes will be removed using the ScaleDownStrategy. |
| UnderTarget | Fewer Nodes exist than the target. New Nodes will be created if resources are available. |
| InternalError | A system error occurred while retrieving Node information for the Node Pool. |
Condition: NodesRemoved
The condition NodesRemoved is applied to a Node Pool when it is pending deletion and Nodes are in the process of being removed. Once all Nodes are removed, the Node Pool will be deleted. This response indicates one of the following conditions:
| Status | Meaning |
|---|---|
Complete | All Nodes have been removed from the Node Pool, and the Node Pool's deletion is imminent. |
Pending | Nodes are in the process of being removed from the Node Pool. |
InternalError | An internal error has occurred while trying to remove Nodes from the Node Pool. |
Condition: Capacity
The Capacity condition indicates whether there is enough capacity available for the requested number of Nodes in the requested instance type. This response indicates one of the following conditions:
| Status | Meaning |
|---|---|
Sufficient | All Nodes have been removed from the Node Pool, and the pool's deletion is imminent. |
Partial | Partial capacity exists in the designated Region to fulfill the request, but not to completely fulfill it. |
NoneAvailable | No Nodes are available of the requested type in the given Region. |
NoneAvailableNodeAffinity | No Nodes are available for the requested instance type due to Affinity constraints. |
PartialNodeAffinity | Partial capacity is available to fulfill the requested targetNodes, full capacity is not available due to Affinity constraints. |
InternalError | An internal error has occurred while attempting to check capacity. |
Condition: Quota
Quota has four statuses.
| Status | Meaning |
|---|---|
Under | The organization is under quota for the Node Pool's instance type. |
Over | The organization is over quota for the Node Pool's instance type. |
NotSet | The organization does not have a quota set for the Node Pool's instance type. |
InternalError | An internal error has occurred attempting to check the quota. |
Healthy Node pools
A new Node pool in a healthy state looks like this when described:
$kubectl describe nodepools example-nodepool
Name: example-nodepoolNamespace:Labels: <none>Annotations: <none>API Version: compute.coreweave.com/v1alpha1Kind: NodePoolMetadata:Creation Timestamp: 2025-06-09T14:48:54ZGeneration: 1Resource Version: 857370UID: 9311678d-4064-45b8-8439-b943250e5852Spec:Autoscaling: falseInstance Type: gd-8xa100-i128Lifecycle:Disable Unhealthy Node Eviction: trueScale Down Strategy: IdleOnlyMax Nodes: 0Min Nodes: 0Target Nodes: 1Status:Conditions:Last Transition Time: 2025-05-30T17:36:00ZMessage: NodePool configuration is valid.Reason: ValidStatus: TrueType: ValidatedLast Transition Time: 2025-05-30T17:36:00ZMessage: Sufficient capacity available for the requested instance type.Reason: SufficientStatus: TrueType: CapacityLast Transition Time: 2025-05-30T17:36:00ZMessage: NodePool is within instance type quota.Reason: UnderStatus: TrueType: QuotaLast Transition Time: 2025-05-30T17:36:00ZMessage: NodePool has reached its target node count.Reason: TargetMetStatus: TrueType: AtTargetCurrent Nodes: 1In Progress: 0Events: <none>
Events
| Event Name | Resource | Description |
|---|---|---|
CWDrainNode | NodePool | Fired when a Node is being drained. |
CWInstanceTypeNotInZone | NodePool | Fired when a Node Pool has an instance type not in its Zone. |
CWInsufficientCapacity | NodePool | Fired when there is not sufficient capacity for a Node Pool. |
CWInvalidInstanceType | NodePool | Fired when a Node Pool contains an invalid instance type. |
CWNodeAssigned | NodePool | Fired when a Node is assigned to a Node Pool. |
CWNodeDeleted | NodePool | Fired to indicate a Node has been deleted. |
CWNodeDeletionRequestSuccess | NodePool | Fired when Node Pool Operator receives a request to delete a Node. |
CWNodeDeliverFail | NodePool | Fired when Node allocation to a Node Pool fails due to misconfigured Node Pool settings or internal issues. |
CWNodePoolCreated | NodePool | Fired when a Node Pool is created. |
CWNodePoolDeleted | NodePool | Fired when a Node Pool is deleted. |
CWNodePoolDisabled | NodePool | Fired when a Node Pool is disabled because the Node is misconfigured and is causing too many boot failures. Contact Support to diagnose and resolve. |
CWNodePoolMetadata | NodePool | Fired when metadata is updated for a Node Pool. |
CWNodePoolNodesRemoved | NodePool | Fired when Nodes are removed from the Node Pool. |
CWNodePoolNodesRemoveError | NodePool | Fired when an error occurs during Node removal. |
CWNodePoolNodesRequestFailed | NodePool | Fired when an error is returned when updating the Node Pool. |
CWNodePoolQuotaCheckFailed | NodePool | Fired when there is an internal error checking the quota. |
CWNodePoolRemoveNodes | NodePool | Fired when attempting to scale down a Node Pool. |
CWNodePoolScaledDown | NodePool | Fired when a Node Pool is scaled down. |
CWNodePoolScaledUp | NodePool | Fired when a Node Pool is scaled up. |
CWNodeRegistered | Node, NodePool | Fired when Node registration succeeds. |
CWNodeRegistrationFailed | Node | Fired when Node registration fails. See the message for additional details. |
CWNodeRequestQueued | NodePool | Fired when a request is submitted to add Nodes to a Node Pool. |
CWOverQuota | NodePool | Fired when the quota is insufficient for the Node Pool's targetNodes. |
CWNodeCordoned | Node, NodePool | Fired when a Node is being cordoned as part of internal automation. |
CWNodeUncordoned | Node, NodePool | Fired when a Node is being uncordoned as part of internal automation. |
CWNodeMarkedPrepareForTerminate | Node, NodePool | Fired when a Node has been sent the signal to prepare for termination. See event message for details. |
CWNodeDraining | Node, NodePool | Fired when a Node is being drained as part of internal automation. |
CWNodeDrainingPDBViolation | Node, NodePool | Fired when a Node is being drained and certain pods cannot be evicted due to a Pod Disruption Budget. |
CWNodeDrainingForceDelete | Node, NodePool | Fired when a Node is being drained as part of internal automation and pods were deleted ungracefully. |
CWNodePreparedForTerminate | Node, NodePool | Fired when a Node is prepared for termination. |
CWNodeTainted | Node, NodePool | Fired when a Node is tainted as part of internal automation. |
CWNodeUntainted | Node, NodePool | Fired when a Node is untainted as part of internal automation. |
CWNodePoolNodeConfigUpdated | NodePool | Fired when a NodePool's Node Config is updated. Nodes will need to be rebooted for the update to take effect. |
CWNodePoolNodeConfigUpdatePending | NodePool | Fired when a Node Config update is pending due to nodes being mid-delivery. The update will complete once Nodes have booted into the cluster. |
CWNodePoolNodeConfigUpdateFailed | NodePool | Fired when a Node Config Update fails due to an internal error. The update will retry immediately. |
Node alerts
| Alert Name | Description |
|---|---|
DCGMSRAMThresholdExceeded | This alert indicates that the SRAM threshold has been exceeded on a GPU. This indicates a memory issue and requires investigation by reliability teams. |
DPUContainerdThreadExhaustion | The DPUContainerdThreadExhaustion alert indicates that the containerd process has run out of threads on the DPU. This requires an update to the dpu-health container to patch. |
DPUContainerdThreadExhaustionCPX | The DPUContainerdThreadExhaustion alert indicates that the containerd process has run out of threads on the DPU. This requires an update to the dpu-health container to patch. |
DPULinkFlappingCPX | The DPULinkFlapping alert indicates that a DPU (Data Processing Unit) link has become unstable. It specifically triggers when a link on a DPU flaps (goes up and down) multiple times within a monitoring period. |
DPUNetworkFrameErrs | The DPUNetworkFrameErrs alert indicates frame errors occurring on DPU (Data Processing Unit) network interfaces. These errors typically indicate a problem with the physical network link. |
DPURouteCountMismatch | The DPURouteCountMismatch alert indicates an inconsistency in routes between what the DPU learns and has installed. A software component on the DPU will need to be restarted. |
DPURouteLoop | The DPURouteLoop alert indicates that a route loop has been detected on the DPU. This can be caused by a miscabling issue in the data center. |
DPURouteLoopCPX | The DPURouteLoop alert indicates that a route loop has been detected on the DPU. This can be caused by a miscabling issue in the data center. |
DPUUnexpectedPuntedRoutes | The DPUUnexpectedPuntedRoutes alert indicates a failure in offloading which can cause connectivity issues for the host. Node will be automatically reset to restore proper connectivity. |
DPUUnexpectedPuntedRoutesCPX | The DPUUnexpectedPuntedRoutes alert indicates a failure in offloading which can cause connectivity issues for the host. The issue typically occurs after a power reset (when the host reboots without the DPU rebooting). |
ECCDoubleVolatileErrors | ECCDoubleVolatileErrors is an alert that indicates when DCGM double-bit volatile ECC (Error Correction Code) errors are increasing over a 5-minute period on a GPU. |
GPUContainedECCError | GPU Contained ECC Error (Xid 94) indicates a uncorrectable memory error was encountered and contained. Workload has been impacted but the node is generally healthy. No action needed. |
GPUECCUncorrectableErrorUncontained | GPU Uncorrectable Error Uncontained (Xid 95) indicates a uncorrectable memory error was encountered but not successfully contained. Workload has been impacted and the node will be restarted. |
GPUFallenOffBus | GPU Fallen Off The Bus (Xid 79) indicates a fatal hardware error where the GPU shuts down and is completely inaccessible from the system. The node will immediately and automatically be taken out of service. |
GPUFallenOffBusHGX | GPU Fallen Off The Bus (Xid 79) indicates a fatal hardware error where the GPU shuts down and is completely inaccessible from the system. The node will immediately and automatically be taken out of service. |
GPUNVLinkSWDefinedError | NVLink SW Defined Error (Xid 155) indicates link down events which are flagged as intentional will trigger this Xid. Node will be reset. |
GPUPGraphicsEngineError | GPU Graphics Enginer Error (Xid 69) has impacted the workload but the node is generally healthy. No action needed. |
GPUPRowRemapFailure | GPU Row Remap Failure (Xid 64) is caused by a uncorrectable error resulting in a GPU memory remapping event that failed. The node will immediately and automatically be taken out of service. |
GPUTimeoutError | GPU Timeout Error (Xid 46) indicates GPU stopped processing and the node will be restarted. |
GPUUncorrectableDRAMError | GPU Uncorrectable DRAM Error (Xid 171) provides complementary information to Xid 48. No action is needed. |
GPUUncorrectableSRAMError | GPU Uncorrectable SRAM Error (Xid 172) provides complementary information to Xid 48. No action is needed. |
GPUVeryHot | The GPUVeryHot alert triggers when a GPU's temperature exceeds 90°C. |
KubeNodeNotReady | The KubeNodeNotReady alert indicates when a node's status condition is not Ready in a Kubernetes cluster. This alert can be an indicator of critical system health issues. |
KubeNodeNotReadyHGX | The KubeNodeNotReadyHGX alert indicates that a node has been unready or offline for more than 15 minutes. |
ManyUCESingleBankH100 | The ManyUCESingleBankH100 alert triggers when there are two or more DRAM Uncorrectable Errors (UCEs) on the same row remapper bank of an H100 GPU. |
MetalDevRedfishError | The MetalDevRedfishError alert indicates an out-of-band action against a BMC failed. |
NVL72GPUHighFECCKS | The NVL72GPUHighFECCKS alert indicates that a GPU is observing a high rate of forward error correction indicating signal integrity issues. |
NVL72SwitchHighFECCKS | The NVL72SwitchHighFECCKS alert indicates that a NVSwitch is observing a high rate of forward error correction indicating signal integrity issues. |
NVLinkDomainFullyTriaged | NVLinkDomainFullyTriaged indicates rack is entirely triaged. This rack should either be investigated for an unexpected rack level event or returned to fleet. |
NVLinkDomainProductionNodeCountLow | NVLinkDomainDegraded indicates rack has less nodes in a production state than expected. This rack will need manual intervention to either restore capacity or reclaim for further triage. |
NodeBackendLinkFault | The NodeBackendLinkFault alert indicates that the backend bandwidth is degraded and the interface may be potentially lost. |
NodeBackendMisconnected | Node-to-leaf ports are either missing or incorrectly connected. |
NodeCPUHZThrottleLong | An extended period of CPU frequency throttling has occured. CPU throttling most often occurs due to power delivery or thermal node level problems. The node will immediately and automatically be taken out of service and the job interrupted. |
NodeGPUNVLinkDown | The node is experiencing NVLink issues and will be automatically triaged. |
NodeGPUXID149NVSwitch | A GPU has experienced a fatal NVLink error. The node will be restarted to recover the GPU. |
NodeGPUXID149s4aLinkIssueFordPintoRepeated | A GPU has experienced a fatal NVLink error. This is a frequent offender and automation will remove the node from the cluster. |
NodeGPUXID149s4aLinkIssueLamboRepeated | A GPU has experienced a fatal NVLink error. This is a frequent offender and automation will remove the node from the cluster. |
NodeGPUXID149s4aLinkIssueNeedsUpgradeRepeated | A GPU has experienced a fatal NVLink error. This is a frequent offender and automation will remove the node from the cluster. |
NodeLoadAverageHigh | The NodeLoadAverageHigh alert triggers when a node's load average exceeds 1000 for more than 15 minutes. |
NodeMemoryError | The NodeMemoryError alert indicates that a node has one or more bad DIMM (memory) modules. |
NodeNetworkReceiveErrs | NodeNetworkReceiveErrs alert indicates that a network interface has encountered receive errors exceeding a 1% threshold over a 2-minute period for 1 hour. |
NodePCIErrorH100GPU | The NodePCIErrorH100GPU alert indicates when a GPU is experiencing PCI bus communication errors. |
NodePCIErrorH100PLX | The NodePCIErrorH100PLX alert indicates a high rate of PCIe bus errors occurring on the PLX switch that connects H100 GPUs. |
NodeRepeatUCE | The NodeRepeatUCE alert indicates that a node has experienced frequent GPU Uncorrectable ECC (UCE) errors. |
NodeVerificationFailureNVFabric | The node is experiencing NVLink issues and will be automatically triaged. |
NodeVerificationMegatronDeadlock | HPC-Perftest failed megatron_lm test due to possible deadlock. Node should be triaged. |
PendingStateExtendedTime | The PendingStateExtendedTime alert indicates that a node has been in a pending state for an extended period of time. This alert helps identify nodes that need to be removed from their current state but are stuck for an extended time. |
PendingStateExtendedTimeLowGpuUtil | The PendingStateExtendedTimeLowGpuUtil alert triggers when a node has been in a pending state for more than 10 days and has had less than 1% GPU utilization in the last hour. This alert helps indicate if a node needs to be removed from its current state but has been stuck for an extended time. |
UnknownNVMLErrorOnContainerStart | The UnknownNVMLErrorOnContainerStart alert typically indicates that a GPU has fallen off the bus or is experiencing hardware issues. |