Skip to main content

Node Pool Reference

Node Pools use the following API schema definitions:

Node Pool

Node Pool is the schema for the Node Pools API.

FieldTypeDescriptionDefault
apiversionstringThe version of the Kubernetes API that the Node Pool uses.compute.coreweave.com/v1alpha1
kindstringThe type of resource being defined, such as a Node PoolNode PoolList
metadataListMetaSee the Kubernetes API documentation for metadata
specNode PoolSpecThe desired state of the Node Pool.
statusNode PoolStatusThe observed state of the Node Pool. See Node PoolStatus.

Node Pool Spec

Node PoolSpec defines the desired state of Node Pool.

FieldTypeDescriptionDefaultValidation
computeClassenum (default)The type of Node Pool. default is used for Reserved instances and On-Demand instances and is the defaultdefaultOptional
instanceTypestringInstance Type for the Nodes in this Node Pool (Note: instanceType is immutable and unchangeable)N/ARequired
targetNodesintegerThe quantity of desired Nodes in the Node PoolN/ARequired
minNodesintegerThe minimum number of TargetNodes allowed by the autoscalerN/AOptional
maxNodesintegerThe maximum number of TargetNodes allowed by the autoscalerN/AOptional
nodeLabelsobject (keys:string, values:string)List of labels associated with the Nodes in the Node PoolN/AOptional
nodeAnnotationsobject (keys:string, values:string)List of annotations associated with the Nodes in the Node PoolN/AOptional
nodeTaintsTaint ArrayList of taints associated with the Nodes in the Node PoolN/AOptional
imageImageThe image to use for the Nodes in the Node Pool.N/AOptional
autoscalingbooleanEnable or disable cluster autoscalingfalseOptional
lifecycle.scaleDownStrategyenum (IdleOnly, PreferIdle)Options for removing Nodes when scaling down to reach targetNodes. IdleOnly removes only idle Nodes. PreferIdle removes idle Nodes first and then active Nodes if needed.IdleOnlyOptional

Image

Image defines what boot image the Node PoolSpec uses.

Warning

Image should be omitted from Node PoolSpec unless directed by CoreWeave support.

FieldTypeDescriptionDefaultValidation
kernelstringKernel version for this Node PoolOptional
namestringName of the image for this Node PoolOptional
releaseTrainstringThe release channel or track for the imagestableOptional Enum: [stable latest]
NVSHMEM + GDRCopy support

If you need an image with NVSHMEM and GDRCopy support, you can request to use ncore-image v2.10.1 by contacting support. You'll need to apply a patch to ibgda in your container, enable GDRCopy, then contact support to get access to this new image.

See NVSHMEM and GDRCopy for more detailed information.

Node PoolStatus

Node PoolStatus is the observed state of Node Pool.

FieldTypeDescriptionValidation
inProgressintegerNumber of Nodes that have been assigned, but have not yet fully booted into the clusterOptional Minimum
currentNodesintegerNumber of Nodes for this Node Pool present in the clusterOptional Minimum
nodeProfilestringNodeProfile represent string of the NodeProfileOptional
conditionsCondition ArrayAll conditions associated with the Node PoolOptional

Conditions

Node Conditions

CoreWeave sets the following conditions on Nodes in a Node Pool.

Condition NameTypeDescription
CWActivebooltrue if the Node is active. If false, and the Node Pool is being scaled down, the Node may be selected for removal.
CWRegisteredbooltrue if the Node is registered. If false, the Node is not registered. Nodes are registered when they enter a customer cluster as part of the Node lifecycle.
CWNodeRemovalbooltrue if the Node is pending removal.

Node Pool Conditions

CoreWeave sets the following conditions on a Node Pool after the Node Pool resource has been created.

Condition: Validated

This condition answers the question: "Is the Node Pool configuration valid?" It has three possible statuses:

StatusDescription
ValidThe Node Pool configuration is correct.
InvalidThe Node Pool configuration has errors, such as an unsupported instance type or incorrect Node affinity.
InternalErrorA system error occurred during validation, so the Node Pool couldn't be fully checked.

Condition: AtTarget

This condition shows whether the Node Pool has the expected number of active Nodes. The AtTarget condition has five possible values:

StatusDescription
TargetMetThe Node Pool has exactly the number of Nodes specified in the target.
PendingDeleteThe Node Pool is being deleted, and its Nodes will be removed.
OverTargetMore Nodes exist than the target. Extra Nodes will be removed using the ScaleDownStrategy.
UnderTargetFewer Nodes exist than the target. New Nodes will be created if resources are available.
InternalErrorA system error occurred while retrieving Node information for the Node Pool.

Condition: NodesRemoved

The condition NodesRemoved is applied to a Node Pool when it is pending deletion and Nodes are in the process of being removed. Once all Nodes are removed, the Node Pool will be deleted. This response indicates one of the following conditions:

StatusMeaning
CompleteAll Nodes have been removed from the Node Pool, and the Node Pool's deletion is imminent.
PendingNodes are in the process of being removed from the Node Pool.
InternalErrorAn internal error has occurred while trying to remove Nodes from the Node Pool.

Condition: Capacity

The Capacity condition indicates whether there is enough capacity available for the requested number of Nodes in the requested instance type. This response indicates one of the following conditions:

StatusMeaning
SufficientAll Nodes have been removed from the Node Pool, and the pool's deletion is imminent.
PartialPartial capacity exists in the designated Region to fulfill the request, but not to completely fulfill it.
NoneAvailableNo Nodes are available of the requested type in the given Region.
NoneAvailableNodeAffinityNo Nodes are available for the requested instance type due to Affinity constraints.
PartialNodeAffinityPartial capacity is available to fulfill the requested targetNodes, full capacity is not available due to Affinity constraints.
InternalErrorAn internal error has occurred while attempting to check capacity.

Condition: Quota

Quota has four statuses.

StatusMeaning
UnderThe organization is under quota for the Node Pool's instance type.
OverThe organization is over quota for the Node Pool's instance type.
NotSetThe organization does not have a quota set for the Node Pool's instance type.
InternalErrorAn internal error has occurred attempting to check the quota.

Healthy Node pools

A new Node pool in a healthy state looks like this when described:

new node pool, showing new conditions with 'describe'
$
kubectl describe nodepools example-nodepool
Example output
Name: example-nodepool
Namespace:
Labels: <none>
Annotations: <none>
API Version: compute.coreweave.com/v1alpha1
Kind: NodePool
Metadata:
Creation Timestamp: 2025-06-09T14:48:54Z
Generation: 1
Resource Version: 857370
UID: 9311678d-4064-45b8-8439-b943250e5852
Spec:
Autoscaling: false
Instance Type: gd-8xa100-i128
Lifecycle:
Disable Unhealthy Node Eviction: true
Scale Down Strategy: IdleOnly
Max Nodes: 0
Min Nodes: 0
Target Nodes: 1
Status:
Conditions:
Last Transition Time: 2025-05-30T17:36:00Z
Message: NodePool configuration is valid.
Reason: Valid
Status: True
Type: Validated
Last Transition Time: 2025-05-30T17:36:00Z
Message: Sufficient capacity available for the requested instance type.
Reason: Sufficient
Status: True
Type: Capacity
Last Transition Time: 2025-05-30T17:36:00Z
Message: NodePool is within instance type quota.
Reason: Under
Status: True
Type: Quota
Last Transition Time: 2025-05-30T17:36:00Z
Message: NodePool has reached its target node count.
Reason: TargetMet
Status: True
Type: AtTarget
Current Nodes: 1
In Progress: 0
Events: <none>

Events

Event NameResourceDescription
CWDrainNodeNodePoolFired when a Node is being drained.
CWInstanceTypeNotInZoneNodePoolFired when a Node Pool has an instance type not in its Zone.
CWInsufficientCapacityNodePoolFired when there is not sufficient capacity for a Node Pool.
CWInvalidInstanceTypeNodePoolFired when a Node Pool contains an invalid instance type.
CWNodeAssignedNodePoolFired when a Node is assigned to a Node Pool.
CWNodeDeletedNodePoolFired to indicate a Node has been deleted.
CWNodeDeletionRequestSuccessNodePoolFired when Node Pool Operator receives a request to delete a Node.
CWNodeDeliverFailNodePoolFired when Node allocation to a Node Pool fails due to misconfigured Node Pool settings or internal issues.
CWNodePoolCreatedNodePoolFired when a Node Pool is created.
CWNodePoolDeletedNodePoolFired when a Node Pool is deleted.
CWNodePoolDisabledNodePoolFired when a Node Pool is disabled because the Node is misconfigured and is causing too many boot failures. Contact Support to diagnose and resolve.
CWNodePoolMetadataNodePoolFired when metadata is updated for a Node Pool.
CWNodePoolNodesRemovedNodePoolFired when Nodes are removed from the Node Pool.
CWNodePoolNodesRemoveErrorNodePoolFired when an error occurs during Node removal.
CWNodePoolNodesRequestFailedNodePoolFired when an error is returned when updating the Node Pool.
CWNodePoolQuotaCheckFailedNodePoolFired when there is an internal error checking the quota.
CWNodePoolRemoveNodesNodePoolFired when attempting to scale down a Node Pool.
CWNodePoolScaledDownNodePoolFired when a Node Pool is scaled down.
CWNodePoolScaledUpNodePoolFired when a Node Pool is scaled up.
CWNodeRegisteredNode, NodePoolFired when Node registration succeeds.
CWNodeRegistrationFailedNodeFired when Node registration fails. See the message for additional details.
CWNodeRequestQueuedNodePoolFired when a request is submitted to add Nodes to a Node Pool.
CWOverQuotaNodePoolFired when the quota is insufficient for the Node Pool's targetNodes.
CWNodeCordonedNode, NodePoolFired when a Node is being cordoned as part of internal automation.
CWNodeUncordonedNode, NodePoolFired when a Node is being uncordoned as part of internal automation.
CWNodeMarkedPrepareForTerminateNode, NodePoolFired when a Node has been sent the signal to prepare for termination. See event message for details.
CWNodeDrainingNode, NodePoolFired when a Node is being drained as part of internal automation.
CWNodeDrainingPDBViolationNode, NodePoolFired when a Node is being drained and certain pods cannot be evicted due to a Pod Disruption Budget.
CWNodeDrainingForceDeleteNode, NodePoolFired when a Node is being drained as part of internal automation and pods were deleted ungracefully.
CWNodePreparedForTerminateNode, NodePoolFired when a Node is prepared for termination.
CWNodeTaintedNode, NodePoolFired when a Node is tainted as part of internal automation.
CWNodeUntaintedNode, NodePoolFired when a Node is untainted as part of internal automation.
CWNodePoolNodeConfigUpdatedNodePoolFired when a NodePool's Node Config is updated. Nodes will need to be rebooted for the update to take effect.
CWNodePoolNodeConfigUpdatePendingNodePoolFired when a Node Config update is pending due to nodes being mid-delivery. The update will complete once Nodes have booted into the cluster.
CWNodePoolNodeConfigUpdateFailedNodePoolFired when a Node Config Update fails due to an internal error. The update will retry immediately.

Node alerts

Alert NameDescription
DCGMSRAMThresholdExceededThis alert indicates that the SRAM threshold has been exceeded on a GPU. This indicates a memory issue and requires investigation by reliability teams.
DPUContainerdThreadExhaustionThe DPUContainerdThreadExhaustion alert indicates that the containerd process has run out of threads on the DPU. This requires an update to the dpu-health container to patch.
DPUContainerdThreadExhaustionCPXThe DPUContainerdThreadExhaustion alert indicates that the containerd process has run out of threads on the DPU. This requires an update to the dpu-health container to patch.
DPULinkFlappingCPXThe DPULinkFlapping alert indicates that a DPU (Data Processing Unit) link has become unstable. It specifically triggers when a link on a DPU flaps (goes up and down) multiple times within a monitoring period.
DPUNetworkFrameErrsThe DPUNetworkFrameErrs alert indicates frame errors occurring on DPU (Data Processing Unit) network interfaces. These errors typically indicate a problem with the physical network link.
DPURouteCountMismatchThe DPURouteCountMismatch alert indicates an inconsistency in routes between what the DPU learns and has installed. A software component on the DPU will need to be restarted.
DPURouteLoopThe DPURouteLoop alert indicates that a route loop has been detected on the DPU. This can be caused by a miscabling issue in the data center.
DPURouteLoopCPXThe DPURouteLoop alert indicates that a route loop has been detected on the DPU. This can be caused by a miscabling issue in the data center.
DPUUnexpectedPuntedRoutesThe DPUUnexpectedPuntedRoutes alert indicates a failure in offloading which can cause connectivity issues for the host. Node will be automatically reset to restore proper connectivity.
DPUUnexpectedPuntedRoutesCPXThe DPUUnexpectedPuntedRoutes alert indicates a failure in offloading which can cause connectivity issues for the host. The issue typically occurs after a power reset (when the host reboots without the DPU rebooting).
ECCDoubleVolatileErrorsECCDoubleVolatileErrors is an alert that indicates when DCGM double-bit volatile ECC (Error Correction Code) errors are increasing over a 5-minute period on a GPU.
GPUContainedECCErrorGPU Contained ECC Error (Xid 94) indicates a uncorrectable memory error was encountered and contained. Workload has been impacted but the node is generally healthy. No action needed.
GPUECCUncorrectableErrorUncontainedGPU Uncorrectable Error Uncontained (Xid 95) indicates a uncorrectable memory error was encountered but not successfully contained. Workload has been impacted and the node will be restarted.
GPUFallenOffBusGPU Fallen Off The Bus (Xid 79) indicates a fatal hardware error where the GPU shuts down and is completely inaccessible from the system. The node will immediately and automatically be taken out of service.
GPUFallenOffBusHGXGPU Fallen Off The Bus (Xid 79) indicates a fatal hardware error where the GPU shuts down and is completely inaccessible from the system. The node will immediately and automatically be taken out of service.
GPUNVLinkSWDefinedErrorNVLink SW Defined Error (Xid 155) indicates link down events which are flagged as intentional will trigger this Xid. Node will be reset.
GPUPGraphicsEngineErrorGPU Graphics Enginer Error (Xid 69) has impacted the workload but the node is generally healthy. No action needed.
GPUPRowRemapFailureGPU Row Remap Failure (Xid 64) is caused by a uncorrectable error resulting in a GPU memory remapping event that failed. The node will immediately and automatically be taken out of service.
GPUTimeoutErrorGPU Timeout Error (Xid 46) indicates GPU stopped processing and the node will be restarted.
GPUUncorrectableDRAMErrorGPU Uncorrectable DRAM Error (Xid 171) provides complementary information to Xid 48. No action is needed.
GPUUncorrectableSRAMErrorGPU Uncorrectable SRAM Error (Xid 172) provides complementary information to Xid 48. No action is needed.
GPUVeryHotThe GPUVeryHot alert triggers when a GPU's temperature exceeds 90°C.
KubeNodeNotReadyThe KubeNodeNotReady alert indicates when a node's status condition is not Ready in a Kubernetes cluster. This alert can be an indicator of critical system health issues.
KubeNodeNotReadyHGXThe KubeNodeNotReadyHGX alert indicates that a node has been unready or offline for more than 15 minutes.
ManyUCESingleBankH100The ManyUCESingleBankH100 alert triggers when there are two or more DRAM Uncorrectable Errors (UCEs) on the same row remapper bank of an H100 GPU.
MetalDevRedfishErrorThe MetalDevRedfishError alert indicates an out-of-band action against a BMC failed.
NVL72GPUHighFECCKSThe NVL72GPUHighFECCKS alert indicates that a GPU is observing a high rate of forward error correction indicating signal integrity issues.
NVL72SwitchHighFECCKSThe NVL72SwitchHighFECCKS alert indicates that a NVSwitch is observing a high rate of forward error correction indicating signal integrity issues.
NVLinkDomainFullyTriagedNVLinkDomainFullyTriaged indicates rack is entirely triaged. This rack should either be investigated for an unexpected rack level event or returned to fleet.
NVLinkDomainProductionNodeCountLowNVLinkDomainDegraded indicates rack has less nodes in a production state than expected. This rack will need manual intervention to either restore capacity or reclaim for further triage.
NodeBackendLinkFaultThe NodeBackendLinkFault alert indicates that the backend bandwidth is degraded and the interface may be potentially lost.
NodeBackendMisconnectedNode-to-leaf ports are either missing or incorrectly connected.
NodeCPUHZThrottleLongAn extended period of CPU frequency throttling has occured. CPU throttling most often occurs due to power delivery or thermal node level problems. The node will immediately and automatically be taken out of service and the job interrupted.
NodeGPUNVLinkDownThe node is experiencing NVLink issues and will be automatically triaged.
NodeGPUXID149NVSwitchA GPU has experienced a fatal NVLink error. The node will be restarted to recover the GPU.
NodeGPUXID149s4aLinkIssueFordPintoRepeatedA GPU has experienced a fatal NVLink error. This is a frequent offender and automation will remove the node from the cluster.
NodeGPUXID149s4aLinkIssueLamboRepeatedA GPU has experienced a fatal NVLink error. This is a frequent offender and automation will remove the node from the cluster.
NodeGPUXID149s4aLinkIssueNeedsUpgradeRepeatedA GPU has experienced a fatal NVLink error. This is a frequent offender and automation will remove the node from the cluster.
NodeLoadAverageHighThe NodeLoadAverageHigh alert triggers when a node's load average exceeds 1000 for more than 15 minutes.
NodeMemoryErrorThe NodeMemoryError alert indicates that a node has one or more bad DIMM (memory) modules.
NodeNetworkReceiveErrsNodeNetworkReceiveErrs alert indicates that a network interface has encountered receive errors exceeding a 1% threshold over a 2-minute period for 1 hour.
NodePCIErrorH100GPUThe NodePCIErrorH100GPU alert indicates when a GPU is experiencing PCI bus communication errors.
NodePCIErrorH100PLXThe NodePCIErrorH100PLX alert indicates a high rate of PCIe bus errors occurring on the PLX switch that connects H100 GPUs.
NodeRepeatUCEThe NodeRepeatUCE alert indicates that a node has experienced frequent GPU Uncorrectable ECC (UCE) errors.
NodeVerificationFailureNVFabricThe node is experiencing NVLink issues and will be automatically triaged.
NodeVerificationMegatronDeadlockHPC-Perftest failed megatron_lm test due to possible deadlock. Node should be triaged.
PendingStateExtendedTimeThe PendingStateExtendedTime alert indicates that a node has been in a pending state for an extended period of time. This alert helps identify nodes that need to be removed from their current state but are stuck for an extended time.
PendingStateExtendedTimeLowGpuUtilThe PendingStateExtendedTimeLowGpuUtil alert triggers when a node has been in a pending state for more than 10 days and has had less than 1% GPU utilization in the last hour. This alert helps indicate if a node needs to be removed from its current state but has been stuck for an extended time.
UnknownNVMLErrorOnContainerStartThe UnknownNVMLErrorOnContainerStart alert typically indicates that a GPU has fallen off the bus or is experiencing hardware issues.