Skip to main content

Node Pool Reference

Node Pools use the following API schema definitions:

Node Pool

Node Pool is the schema for the Node Pools API.

FieldTypeDescriptionDefault
apiversionstringThe version of the Kubernetes API that the Node Pool uses.compute.coreweave.com/v1alpha1
kindstringThe type of resource being defined, such as a Node PoolNode PoolList
metadataListMetaSee the Kubernetes API documentation for metadata
specNode PoolSpecThe desired state of the Node Pool.
statusNode PoolStatusThe observed state of the Node Pool. See Node PoolStatus.

Node Pool Spec

Node PoolSpec defines the desired state of Node Pool.

FieldTypeDescriptionDefaultValidation
computeClassenum (default)The type of Node Pool. default is used for Reserved instances and On-Demand instances and is the defaultdefaultOptional
instanceTypestringInstance Type for the Nodes in this Node Pool (Note: instanceType is immutable and unchangeable)N/ARequired
targetNodesintegerThe quantity of desired Nodes in the Node PoolN/ARequired
minNodesintegerThe minimum number of TargetNodes allowed by the autoscalerN/AOptional
maxNodesintegerThe maximum number of TargetNodes allowed by the autoscalerN/AOptional
nodeLabelsobject (keys:string, values:string)List of labels associated with the Nodes in the Node PoolN/AOptional
nodeAnnotationsobject (keys:string, values:string)List of annotations associated with the Nodes in the Node PoolN/AOptional
nodeTaintsTaint ArrayList of taints associated with the Nodes in the Node PoolN/AOptional
imageImageThe image to use for the Nodes in the Node Pool. (Note: image and gpu are mutually exclusive; if both are set, gpu will be ignored)

For updates to take effect, perform a reconfigure reboot.
N/AOptional
gpuGpuGPU driver configuration for the Nodes in the Node Pool. (Note: gpu and image are mutually exclusive; if both are set, gpu will be ignored)

For updates to take effect, perform a reconfigure reboot.
N/AOptional
autoscalingbooleanEnable or disable cluster autoscalingfalseOptional
lifecycle.scaleDownStrategyenum (IdleOnly, PreferIdle)Options for removing Nodes when scaling down to reach targetNodes. IdleOnly removes only idle Nodes. PreferIdle removes idle Nodes first and then active Nodes if needed.IdleOnlyOptional

Image

Image defines what boot image the Node PoolSpec uses.

Warning

Image should be omitted from Node PoolSpec unless directed by CoreWeave support. The image and gpu fields are mutually exclusive - if both are set, gpu will be ignored and image will take precedence.

FieldTypeDescriptionDefaultValidation
kernelstringKernel version for this Node PoolOptional
namestringName of the image for this Node Pool. If name is set, kernel must be empty and releaseTrain must be stable.Optional
releaseTrainstringThe release channel or track for the imagestableOptional Enum: [stable latest]
NVSHMEM + GDRCopy support

If you need an image with NVSHMEM and GDRCopy support, you can request to use ncore-image v2.10.1 by contacting support. You'll need to apply a patch to ibgda in your container, enable GDRCopy, then contact support to get access to this new image.

See NVSHMEM and GDRCopy for more detailed information.

Gpu

The gpu field defines the GPU driver configuration for Nodes in the Node PoolSpec.

Warning
  • The gpu and image fields are mutually exclusive. If both are set, gpu will be ignored and image will take precedence.
FieldTypeDescriptionDefaultValidation
versionstringThe specific GPU driver version to use. Note that only the major version is specified.Optional

Node PoolStatus

Node Pool Status is the observed state of a Node Pool.

FieldTypeDescriptionValidation
queuedNodesintegerNumber of queued Nodes waiting to be assigned.Optional Minimum
inProgressintegerNumber of Nodes that have been assigned, but have not yet fully booted into the clusterOptional Minimum
currentNodesintegerNumber of Nodes for this Node Pool present in the clusterOptional Minimum
nodeProfilestringNodeProfile is the unique identifier for the active Node configurationOptional
pendingNodeConfigurationobject (createdAt: timestamp, nodeProfile: string, summary: string array)Contains information about the staged Node configuration. This field is only present if NodePool spec changes were made that require a new Config, or if there are updates available such as a new ncore image.Optional
nodeConfigurationRevisionsnodeConfigurationRevision ArrayMaintains a list of Node configurations that were applied to the NodePool, along with data about each one.Optional
conditionsCondition ArrayAll conditions associated with the Node PoolOptional

Pending Node Configuration

The pendingNodeConfiguration contains information about the configuration that is currently staged on the NodePool. This configuration will not get set as the active configuration without an explicit upgrade from the user.

This field is only present if there are configuration updates available for your NodePool. For example, modifying the spec.image will stage a new configuration that has the desired image. Configurations can also get staged automatically by CKS, if there are updates available such as a new ncore version, GPU driver version, K8s version, etc. The summary field provides more info about the available updates. See Manage Node Pool Configuration for a guide on how to promote a pending configuration.

FieldTypeDescription
createdAttimestampWhen the Node configuration was created
nodeProfilestringThe unique identifier for the configuration
summarystring arrayA summary of the features for the configuration, including things like ncore version, GPU driver version, K8s version, etc.

Node Configuration Revisions

The revisions list holds history of the Node configurations that were applied to the NodePool, i.e. configurations that were at one point the status.nodeProfile.

The list is sorted by creation timestamp, and can be used as a reference to rollback to a previous Node configuration if desired. See Manage Node Pool Configuration for a guide on how to roll back to a previous configuration.

FieldTypeDescription
activeNodesintegerThe count of Nodes in the NodePool that are currently booted into this configuration
createdAttimestampThe creation timestamp for the configuration
nodeProfilestringThe unique identifier for the configuration
disabledbooleanIndicates if the configuration is blocked from being applied to more Nodes. A configuration gets disabled if it experiences successive boot failures. Please contact support if this is true
summarystring arrayA summary of the features for the configuration, including things like ncore version, GPU driver version, K8s version, etc.

Conditions

Node Conditions

CoreWeave sets the following conditions on Nodes in a Node Pool.

Condition NameTypeDescription
CWActivebooltrue if the Node is active. If false, and the Node Pool is being scaled down, the Node may be selected for removal.
CWRegisteredbooltrue if the Node is registered. If false, the Node is not registered. Nodes are registered when they enter a customer cluster as part of the Node lifecycle.
CWNodeRemovalbooltrue if the Node is pending removal.

Node Pool Conditions

CoreWeave sets the following conditions on a Node Pool after the Node Pool resource has been created.

Condition: Validated

This condition answers the question: "Is the Node Pool configuration valid?" It has three possible statuses:

StatusDescription
ValidThe Node Pool configuration is correct.
InvalidThe Node Pool configuration has errors, such as an unsupported instance type or incorrect Node affinity.
InternalErrorA system error occurred during validation, so the Node Pool couldn't be fully checked.

Condition: AtTarget

This condition shows whether the Node Pool has the expected number of active Nodes. The AtTarget condition has five possible values:

StatusDescription
TargetMetThe Node Pool has exactly the number of Nodes specified in the target.
PendingDeleteThe Node Pool is being deleted, and its Nodes will be removed.
OverTargetMore Nodes exist than the target. Extra Nodes will be removed using the ScaleDownStrategy.
UnderTargetFewer Nodes exist than the target. New Nodes will be created if resources are available.
InternalErrorA system error occurred while retrieving Node information for the Node Pool.

Condition: NodesRemoved

The condition NodesRemoved is applied to a Node Pool when it is pending deletion and Nodes are in the process of being removed. Once all Nodes are removed, the Node Pool will be deleted. This response indicates one of the following conditions:

StatusMeaning
CompleteAll Nodes have been removed from the Node Pool, and the Node Pool's deletion is imminent.
PendingNodes are in the process of being removed from the Node Pool.
InternalErrorAn internal error has occurred while trying to remove Nodes from the Node Pool.

Condition: Capacity

The Capacity condition indicates whether there is enough capacity available for the requested number of Nodes in the requested instance type. This response indicates one of the following conditions:

StatusMeaning
SufficientAll Nodes have been removed from the Node Pool, and the pool's deletion is imminent.
PartialPartial capacity exists in the designated Region to fulfill the request, but not to completely fulfill it.
NoneAvailableNo Nodes are available of the requested type in the given Region.
NoneAvailableNodeAffinityNo Nodes are available for the requested instance type due to Affinity constraints.
PartialNodeAffinityPartial capacity is available to fulfill the requested targetNodes, full capacity is not available due to Affinity constraints.
QueuedAwaitingCapacityRequest for additional Nodes has been queued and is awaiting for additional capacity to become available.
InternalErrorAn internal error has occurred while attempting to check capacity.

Condition: Quota

Quota has four statuses.

StatusMeaning
UnderThe organization is under quota for the Node Pool's instance type.
OverThe organization is over quota for the Node Pool's instance type.
NotSetThe organization does not have a quota set for the Node Pool's instance type.
InternalErrorAn internal error has occurred attempting to check the quota.

Condition: NodeReconfigurationRequired

This condition answers the question: "Do any of my Nodes need to be reconfigured to boot into the active config (status.nodeProfile)?" It has three possible statuses:

StatusDescription
StagedNodeConfigThe Node Pool has Nodes that need to be reconfigured to boot into the active config.
AllNodesUpToDateAll of the Nodes in the Node Pool are booted into the active config.
InternalErrorA system error occurred fetching the active configuration for Nodes. It's also possible there was a failure staging the Node Pool's active Node configuration onto outdated Nodes. See the condition message for more information.

Healthy Node Pools

A new Node Pool in a healthy state looks like this when described:

new node pool, showing new conditions with 'describe'
$
kubectl describe nodepools example-nodepool
Example output
Name: example-nodepool
Namespace:
Labels: <none>
Annotations: <none>
API Version: compute.coreweave.com/v1alpha1
Kind: NodePool
Metadata:
Creation Timestamp: 2025-06-09T14:48:54Z
Generation: 1
Resource Version: 857370
UID: 9311678d-4064-45b8-8439-b943250e5852
Spec:
Autoscaling: false
Instance Type: gd-8xa100-i128
Lifecycle:
Disable Unhealthy Node Eviction: true
Scale Down Strategy: IdleOnly
Max Nodes: 0
Min Nodes: 0
Target Nodes: 1
Status:
Conditions:
Last Transition Time: 2025-05-30T17:36:00Z
Message: NodePool configuration is valid.
Reason: Valid
Status: True
Type: Validated
Last Transition Time: 2025-05-30T17:36:00Z
Message: Sufficient capacity available for the requested instance type.
Reason: Sufficient
Status: True
Type: Capacity
Last Transition Time: 2025-05-30T17:36:00Z
Message: NodePool is within instance type quota.
Reason: Under
Status: True
Type: Quota
Last Transition Time: 2025-05-30T17:36:00Z
Message: NodePool has reached its target node count.
Reason: TargetMet
Status: True
Type: AtTarget
Last Transition Time: 2025-05-30T17:36:00Z
Message: All nodes are on the current configuration.
Reason: AllNodesUpToDate
Status: False
Type: NodeReconfigurationRequired
Current Nodes: 1
In Progress: 0
Queued Nodes: 0
Events: <none>

Events

Event NameResourceDescription
CWDrainNodeNodePoolFired when a Node is being drained.
CWInstanceTypeNotInZoneNodePoolFired when a Node Pool has an instance type not in its Zone.
CWInsufficientCapacityNodePoolFired when there is not sufficient capacity for a Node Pool.
CWInvalidInstanceTypeNodePoolFired when a Node Pool contains an invalid instance type.
CWNodeAssignedNodePoolFired when a Node is assigned to a Node Pool.
CWNodeDeletedNodePoolFired to indicate a Node has been deleted.
CWNodeDeletionRequestSuccessNodePoolFired when Node Pool Operator receives a request to delete a Node.
CWNodeDeliverFailNodePoolFired when Node allocation to a Node Pool fails due to misconfigured Node Pool settings or internal issues.
CWNodePoolCreatedNodePoolFired when a Node Pool is created.
CWNodePoolDeletedNodePoolFired when a Node Pool is deleted.
CWNodePoolDisabledNodePoolFired when a Node Pool is disabled because the Node is misconfigured and is causing too many boot failures. Contact Support to diagnose and resolve.
CWNodePoolMetadataNodePoolFired when metadata is updated for a Node Pool.
CWNodePoolNodesRemovedNodePoolFired when Nodes are removed from the Node Pool.
CWNodePoolNodesRemoveErrorNodePoolFired when an error occurs during Node removal.
CWNodePoolNodesRequestFailedNodePoolFired when an error is returned when updating the Node Pool.
CWNodePoolQuotaCheckFailedNodePoolFired when there is an internal error checking the quota.
CWNodePoolRemoveNodesNodePoolFired when attempting to scale down a Node Pool.
CWNodePoolScaledDownNodePoolFired when a Node Pool is scaled down.
CWNodePoolScaledUpNodePoolFired when a Node Pool is scaled up.
CWNodeRegisteredNode, NodePoolFired when Node registration succeeds.
CWNodeRegistrationFailedNodeFired when Node registration fails. See the message for additional details.
CWNodeRequestQueuedNodePoolFired when a request is submitted to add Nodes to a Node Pool.
CWOverQuotaNodePoolFired when the quota is insufficient for the Node Pool's targetNodes.
CWNodeCordonedNode, NodePoolFired when a Node is being cordoned as part of internal automation.
CWNodeUncordonedNode, NodePoolFired when a Node is being uncordoned as part of internal automation.
CWNodeMarkedPrepareForTerminateNode, NodePoolFired when a Node has been sent the signal to prepare for termination. See event message for details.
CWNodeDrainingNode, NodePoolFired when a Node is being drained as part of internal automation.
CWNodeDrainingPDBViolationNode, NodePoolFired when a Node is being drained and certain pods cannot be evicted due to a Pod Disruption Budget.
CWNodeDrainingForceDeleteNode, NodePoolFired when a Node is being drained as part of internal automation and pods were deleted ungracefully.
CWNodePreparedForTerminateNode, NodePoolFired when a Node is prepared for termination.
CWNodeTaintedNode, NodePoolFired when a Node is tainted as part of internal automation.
CWNodeUntaintedNode, NodePoolFired when a Node is untainted as part of internal automation.
CWNodePoolNodeConfigUpdatedNodePoolFired when a NodePool's Node Config is updated. Nodes will need to be rebooted for the update to take effect.
CWNodePoolNodeConfigUpdatePendingNodePoolFired when a Node Config update is pending due to nodes being mid-delivery. The update will complete once Nodes have booted into the cluster.
CWNodePoolNodeConfigUpdatedNodePoolFired when the active Node Config (status.nodeProfile) for the NodePool has been updated. This might be a result of promoting the pendingNodeConfiguration or rolling back to a previous configuration.
CWNodePoolNoPendingConfigurationNodePoolFired when a user attempts to promote the status.pendingNodeConfiguration to become the active configuration (status.nodeProfile), but there is no staged configuration.
CWNodePoolNodeConfigRollbackFailedNodePoolFired when there was an error rolling back the active Node configuration for the NodePool to a previous config.
CWNodePoolNodeConfigUpdateConflictNodePoolFired when a user simulteaneouly attempts to upgrade to the pendingNodeConfig and rollback to a previous one.
CWNodeConfigStagingFailedNodeFired when the Node Pool Operator attempted to stage the NodePool's active configuration on an existing Node in the cluster to prepare it for a reconfigure reboot, but was not able to do so. It will keep retrying until successful.
CWManualUpgradeTriggerNodePoolFired when a user upgrades a NodePool to the pending configuration.
CWManualRollbackTriggerNodePoolFired when a user rolls back a NodePool to a previous configuration.

Node alerts

Alert NameDescription
BackendNodeFrameErrorOnLeafAn error rate between the node and the leaf switch on the backend network can lead to degraded performance. The node will be automatically taken out of service and replaced when free of workloads.
BackendSlidingWindowBERLeafAn error rate between the node and the leaf switch on the backend network can lead to degraded performance. The node will be automatically taken out of service and replaced when free of workloads.
DCGMSRAMThresholdExceededThis alert indicates that the SRAM threshold has been exceeded on a GPU. This indicates a memory issue and requires investigation by reliability teams.
DCGMThrottleHWPowerBrakeThe GPU is receiving a power throttling signal from the motherboard. Likely a power delivery issue. The node will be taken out of service for investigation after current workloads finish.
DPUContainerdThreadExhaustionThe DPUContainerdThreadExhaustion alert indicates that the containerd process has run out of threads on the DPU. This requires an update to the dpu-health container to patch.
DPUContainerdThreadExhaustionCPXThe DPUContainerdThreadExhaustion alert indicates that the containerd process has run out of threads on the DPU. This requires an update to the dpu-health container to patch.
DPULinkFlappingCPXThe DPULinkFlapping alert indicates that a DPU (Data Processing Unit) link has become unstable. It specifically triggers when a link on a DPU flaps (goes up and down) multiple times within a monitoring period.
DPUNetworkFrameErrsThe DPUNetworkFrameErrs alert indicates frame errors occurring on DPU (Data Processing Unit) network interfaces. These errors typically indicate a problem with the physical network link.
DPURouteCountMismatchThe DPURouteCountMismatch alert indicates an inconsistency in routes between what the DPU learns and has installed. A software component on the DPU will need to be restarted.
DPURouteLoopThe DPURouteLoop alert indicates that a route loop has been detected on the DPU. This can be caused by a miscabling issue in the data center.
DPURouteLoopCPXThe DPURouteLoop alert indicates that a route loop has been detected on the DPU. This can be caused by a miscabling issue in the data center.
DPUUnexpectedPuntedRoutesThe DPUUnexpectedPuntedRoutes alert indicates a failure in offloading which can cause connectivity issues for the host. Node will be automatically reset to restore proper connectivity.
DPUUnexpectedPuntedRoutesCPXThe DPUUnexpectedPuntedRoutes alert indicates a failure in offloading which can cause connectivity issues for the host. The issue typically occurs after a power reset (when the host reboots without the DPU rebooting).
DPUUnexpectedPuntedRoutesNoRebootThe DPUUnexpectedPuntedRoutesNoReboot alert indicates a failure in offloading which can cause connectivity issues for the host. The node will not be auto-rebooted and will need to be manually investigated.
ECCDoubleVolatileErrorsECCDoubleVolatileErrors is an alert that indicates when DCGM double-bit volatile ECC (Error Correction Code) errors are increasing over a 5-minute period on a GPU.
GPUContainedECCErrorGPU Contained ECC Error (Xid 94) indicates a uncorrectable memory error was encountered and contained. Workload has been impacted but the node is generally healthy. No action needed.
GPUECCUncorrectableErrorUncontainedGPU Uncorrectable Error Uncontained (Xid 95) indicates a uncorrectable memory error was encountered but not successfully contained. Workload has been impacted and the node will be restarted.
GPUFallenOffBusGPU Fallen Off The Bus (Xid 79) indicates a fatal hardware error where the GPU shuts down and is completely inaccessible from the system. The node will immediately and automatically be taken out of service.
GPUFallenOffBusHGXGPU Fallen Off The Bus (Xid 79) indicates a fatal hardware error where the GPU shuts down and is completely inaccessible from the system. The node will immediately and automatically be taken out of service.
GPUNVLinkSWDefinedErrorNVLink SW Defined Error (Xid 155) indicates link down events which are flagged as intentional will trigger this Xid. Node will be reset.
GPUPGraphicsEngineErrorGPU Graphics Enginer Error (Xid 69) has impacted the workload but the node is generally healthy. No action needed.
GPURowRemapFailureGPU Row Remap Failure (Xid 64) is caused by a uncorrectable error resulting in a GPU memory remapping event that failed. The node will immediately and automatically be taken out of service.
GPUTimeoutErrorGPU Timeout Error (Xid 46) indicates GPU stopped processing and the node will be restarted.
GPUUncorrectableDRAMErrorGPU Uncorrectable DRAM Error (Xid 171) provides complementary information to Xid 48. No action is needed.
GPUUncorrectableSRAMErrorGPU Uncorrectable SRAM Error (Xid 172) provides complementary information to Xid 48. No action is needed.
GPUVeryHotThe GPUVeryHot alert triggers when a GPU's temperature exceeds 90°C.
GPUXID149LinkIssueHGXA GPU has experienced a fatal loss of link across the internal NVLink domain. The node will be restarted to recover the GPU.
KernelDeadlockA kernel deadlock has been detected on this node. This indicates a severe system issue where processes are hung.
KernelHardlockA CPU hard lockup has been detected on this node via the NMI watchdog.
KubeNodeNotReadyThe KubeNodeNotReady alert indicates when a node's status condition is not Ready in a Kubernetes cluster. This alert can be an indicator of critical system health issues.
KubeNodeNotReadyHGXThe KubeNodeNotReadyHGX alert indicates that a node has been unready or offline for more than 15 minutes.
ManyUCESingleBankH100The ManyUCESingleBankH100 alert triggers when there are two or more DRAM Uncorrectable Errors (UCEs) on the same row remapper bank of an H100 GPU.
MetalDevRedfishErrorThe MetalDevRedfishError alert indicates an out-of-band action against a BMC failed.
NVL72GPUHighFECCKSThe NVL72GPUHighFECCKS alert indicates that a GPU is observing a high rate of forward error correction indicating signal integrity issues.
NVL72SwitchHighFECCKSThe NVL72SwitchHighFECCKS alert indicates that a NVSwitch is observing a high rate of forward error correction indicating signal integrity issues.
NVLinkDomainDriverVersionMismatchNode has an outdated NVIDIA driver major/minor version compared to other nodes in its NVLink domain. The node will be rebooted to upgrade to the latest driver version.
NVLinkDomainFullyTriagedNVLinkDomainFullyTriaged indicates rack is entirely triaged. This rack should either be investigated for an unexpected rack level event or returned to fleet.
NVLinkDomainProductionNodeCountLowNVLinkDomainDegraded indicates rack has less nodes in a production state than expected. This rack will need manual intervention to either restore capacity or reclaim for further triage.
NVLinkMaskErrorGPUs are reporting link mask errors. This may indicate NVLink connectivity issues that could affect GPU communication.
NVLinkXIDFatalA fatal NVSwitch XID error has been detected. XIDs 144-150 indicate NVLink hardware failures that prevent proper GPU communication.
NodeBackendLinkFaultThe NodeBackendLinkFault alert indicates that the backend bandwidth is degraded and the interface may be potentially lost.
NodeBackendMisconnectedNode-to-leaf ports are either missing or incorrectly connected.
NodeCPUHZThrottleLongAn extended period of CPU frequency throttling has occured. CPU throttling most often occurs due to power delivery or thermal node level problems. The node will immediately and automatically be taken out of service and the job interrupted.
NodeGPUNVLBWLossLinkRecoveryBandwidth degradation detected on NVLink due to frequent recovery events.
NodeGPUNVLBWLossRetransmitNodeBandwidth degradation detected on NVLink due to high transmit retry rates.
NodeGPUNVLBWLossRetransmitSwitchBandwidth degradation detected on NVLink due to high transmit retry rates.
NodeGPUNVLinkDownThe node is experiencing NVLink issues and will be automatically triaged.
NodeGPUXID149NVSwitchA GPU has experienced a fatal NVLink error. The node will be restarted to recover the GPU.
NodeGPUXID149s4aLinkIssueFordPintoRepeatedA GPU has experienced a fatal NVLink error. This is a frequent offender and automation will remove the node from the cluster.
NodeGPUXID149s4aLinkIssueLamboRepeatedA GPU has experienced a fatal NVLink error. This is a frequent offender and automation will remove the node from the cluster.
NodeGPUXID149s4aLinkIssueNeedsUpgradeRepeatedA GPU has experienced a fatal NVLink error. This is a frequent offender and automation will remove the node from the cluster.
NodeGPUXID149s4aLinkIssueRepeatedGB300A GPU has experienced a fatal NVLink error. This is a frequent offender and automation will remove the node from the cluster.
NodeLoadAverageHighThe NodeLoadAverageHigh alert triggers when a node's load average exceeds 1000 for more than 15 minutes.
NodeMemoryErrorThe NodeMemoryError alert indicates that a node has one or more bad DIMM (memory) modules.
NodeNetworkReceiveErrsNodeNetworkReceiveErrs alert indicates that a network interface has encountered receive errors exceeding a 1% threshold over a 2-minute period for 1 hour.
NodePCIErrorH100GPUThe NodePCIErrorH100GPU alert indicates when a GPU is experiencing PCI bus communication errors.
NodePCIErrorH100PLXThe NodePCIErrorH100PLX alert indicates a high rate of PCIe bus errors occurring on the PLX switch that connects H100 GPUs.
NodeRepeatUCEThe NodeRepeatUCE alert indicates that a node has experienced frequent GPU Uncorrectable ECC (UCE) errors.
NodeVerificationFailureNVFabricThe node is experiencing NVLink issues and will be automatically triaged.
NodeVerificationMegatronDeadlockHPC-Perftest failed megatron_lm test due to possible deadlock. Node should be triaged.
PendingStateExtendedTimeThe PendingStateExtendedTime alert indicates that a node has been in a pending state for an extended period of time. This alert helps identify nodes that need to be removed from their current state but are stuck for an extended time.
PendingStateExtendedTimeLowGpuUtilThe PendingStateExtendedTimeLowGpuUtil alert triggers when a node has been in a pending state for more than 10 days and has had less than 1% GPU utilization in the last hour. This alert helps indicate if a node needs to be removed from its current state but has been stuck for an extended time.
PersistentStorageFaultCephFS quota errors have been detected on this node. This may indicate issues with persistent storage mounts.
ReadonlyFilesystemA filesystem on this node has been remounted as read-only.
Sector0LocalDiskErrorsI/O errors have been detected at sector 0 of the local NVMe storage.
SuspectedLocalDiskErrorsI/O errors have been detected on local NVMe storage at non-sector-0 locations. This may indicate a failing disk.
UnknownNVMLErrorOnContainerStartThe UnknownNVMLErrorOnContainerStart alert typically indicates that a GPU has fallen off the bus or is experiencing hardware issues.