Node cordoning - CoreWeave Docs

This guide explains why Nodes in CKS become cordoned, what this means, and how you should respond.

What is Node cordoning?

In Kubernetes, a cordoned Node prevents new workloads from scheduling onto it. Existing workloads on the Node continue to run until you explicitly drain them or they complete. Cordoning typically indicates maintenance, troubleshooting, or reliability actions. CoreWeave cordons Nodes proactively to ensure reliability, maintain cluster health, and manage maintenance activities with minimal impact on workloads.

Reasons for Node cordoning

Nodes in CKS may be cordoned for several reasons:

Maintenance and updates

CoreWeave cordons Nodes that require hardware maintenance, kernel updates, or hardware replacements to avoid scheduling new workloads. This practice allows maintenance activities without disrupting running workloads.

Health monitoring

CoreWeave monitors Node health and performance. If a Node exhibits issues that could affect workload reliability, CoreWeave may cordon it to prevent new workloads from being scheduled. These issues can include:

Hardware issues: GPU hardware errors trigger automatic Node cordoning and rebooting.
Kernel deadlocks: Severe kernel issues may also cause Nodes to reboot and remain cordoned until resolved.
InfiniBand or Ethernet link flaps: CoreWeave cordons Nodes with unstable network connections (intermittent connectivity) so workloads aren’t placed on unreliable Nodes.
Connectivity failures: If Kubernetes detects critical network issues, CoreWeave cordons the affected Nodes until stability is restored.

Time synchronization

CKS Nodes synchronize time to CoreWeave’s internal time infrastructure. CoreWeave monitors time on all Nodes, and a Node whose clock drifts beyond acceptable bounds may be cordoned for triage. Running any competing time daemon (including chrony, ntpd, systemd-timesyncd) directly on a Node or in containers may cause clock drift and trigger this behaviour. Overriding the time servers on CKS Nodes is not supported. Containers with time daemons must be configured to disable the daemon and/or prevent the daemon from altering the kernel clock.

Temporary issues

Temporary issues detected through Kubernetes health checks can lead to short-term Node cordoning. These usually resolve automatically, and Nodes become schedulable again without intervention.

Scheduled maintenance

CoreWeave intentionally cordons Nodes during scheduled maintenance.

Lifecycle management

CoreWeave manages Node lifecycle events, such as provisioning, decommissioning, reboots, and upgrades. Temporary cordoning may occur when Nodes transition between different operational states to maintain cluster stability and reliability during lifecycle transitions.

Check why a Node is cordoned

When CoreWeave cordons a Node, it records the reason in the node.coreweave.cloud/cordonReason annotation. Check this annotation to understand whether the cordon is an expected maintenance action or a possible sign of poor Node health. Read the cordon reason for a Node:

kubectl get node [NODE-NAME] -o jsonpath='{.metadata.annotations.node\.coreweave\.cloud/cordonReason}'

Replace [NODE-NAME] with the name of your Node. For the Node’s full status, including whether it is unschedulable and its current conditions, use describe:

kubectl describe node [NODE-NAME]

For lifecycle reasons such as NLCCPendingExitProduction, the matching Node condition message names the state the Node is moving to and the CoreWeave engineer who initiated the transition:

Example condition

NLCCPendingExitProduction   True   Reason: triage
Message: PendingPhaseState transition out of production to triage by user <username>

Common cordon and drain reasons

The following table lists the cordon reasons you are most likely to see on a CKS Node, what each one means, and whether it represents an expected maintenance action or a possible health issue.

Cordon reason	What it means	Maintenance or health signal	What to do
`NLCCPendingExitProduction`	CoreWeave’s Node lifecycle controller is removing the Node from production for a planned reason, such as maintenance, firmware testing, hardware investigation, a reboot, or a move to triage. The Node is cordoned, but existing workloads keep running until the Node is idle.	Expected. This is a planned lifecycle action, not a Node fault.	No action is required. Let your workloads finish, or move them when you are ready. Don’t uncordon the Node, because CoreWeave automation reapplies the cordon. Once the Node is idle, it leaves the cluster and CoreWeave delivers a replacement.
`NeedsTriage`	CoreWeave health monitoring detected a possible hardware or health issue and flagged the Node for investigation. The Node receives a `NoSchedule` taint so that new workloads aren’t scheduled onto it.	Health signal. CoreWeave investigates the Node and, if needed, remediates or replaces it.	Plan to move affected workloads off the Node. Once the Node is idle, CoreWeave moves it to triage and delivers a replacement. If a workload is failing on the Node, contact support with the Node name.
`AdminTemporaryFailure`	A manual cordon. A CoreWeave engineer may set it to isolate a Node for a short-term reason, or you may set it yourself. It prevents new scheduling but doesn’t, on its own, start a CoreWeave investigation.	Depends on who set it. It marks a manual action rather than an automatic health verdict.	If you didn’t set it and you are unsure why it is present, contact support with the Node name. If you set it to flag a problem, also notify CoreWeave so the Node is investigated.
Manual cordon (`SchedulingDisabled` with no CoreWeave cordon reason)	Someone ran `kubectl cordon` on the Node. The Node has no `node.coreweave.cloud/cordonReason` annotation.	Your own action, not a CoreWeave maintenance event.	Uncordon the Node with `kubectl uncordon [NODE-NAME]` when you are ready. A plain `kubectl cordon` can be cleared by CoreWeave automation, so coordinate any long-term isolation with support.

A cordon by itself prevents new workloads from scheduling onto a Node. It does not evict the workloads already running there. For most maintenance reasons, CoreWeave waits for the Node to become idle before continuing, which lets your workloads finish on their own schedule.

What to do when a Node is cordoned

If you notice that one of your Nodes is cordoned, the following actions can help you assess the situation and decide whether to intervene.

Check Node status

Confirm the Node state through your Grafana dashboards or Kubernetes tools to verify the reason for cordoning. If workloads remain running and stable, you don’t need to take immediate action.

Monitor workloads

Ensure your workloads have redundancy or are configured for high availability to tolerate temporary Node unavailability. If workloads become impacted, evaluate whether manual intervention is necessary, such as deleting affected Pods to trigger rescheduling.

Persistent issues

If a Node remains cordoned for an extended period or your workload is severely impacted, open a support ticket to inform CoreWeave about the issue. Clearly document your observations, such as the Node status and workload behavior, in your support request.

Important considerations

Keep the following considerations in mind when working with cordoned Nodes:

Don’t rely on Node conditions defined by CoreWeave for automation: The Node conditions that CoreWeave sets are intended solely for internal CoreWeave use. Don’t modify them or rely on them for your own automation or management. You can still add your own custom Node conditions.
Cordoning does not always indicate a serious or permanent Node issue: Nodes often recover automatically after temporary issues. If the issue is severe or persistent, CoreWeave proactively moves affected Nodes out of production into triage.

Support

If you need assistance or have questions regarding Node cordoning, contact CoreWeave support.

​What is Node cordoning?

​Reasons for Node cordoning

​Maintenance and updates

​Health monitoring

​Time synchronization

​Temporary issues

​Scheduled maintenance

​Lifecycle management

​Check why a Node is cordoned

​Common cordon and drain reasons

​What to do when a Node is cordoned

​Check Node status

​Monitor workloads

​Persistent issues

​Important considerations

​Support