Skip to main content
This guide explains why Nodes in CKS become cordoned, what this means, and how you should respond.

What is Node cordoning?

In Kubernetes, a cordoned Node prevents new workloads from scheduling onto it. Existing workloads on the Node continue to run until you explicitly drain them or they complete. Cordoning typically indicates maintenance, troubleshooting, or reliability actions. CoreWeave cordons Nodes proactively to ensure reliability, maintain cluster health, and manage maintenance activities with minimal impact on workloads.

Reasons for Node cordoning

Nodes in CKS may be cordoned for several reasons:

Maintenance and updates

CoreWeave cordons Nodes that require hardware maintenance, kernel updates, or hardware replacements to avoid scheduling new workloads. This practice allows maintenance activities without disrupting running workloads.

Health monitoring

CoreWeave monitors Node health and performance. If a Node exhibits issues that could affect workload reliability, CoreWeave may cordon it to prevent new workloads from being scheduled. These issues can include:
  • Hardware issues: GPU hardware errors trigger automatic Node cordoning and rebooting.
  • Kernel deadlocks: Severe kernel issues may also cause Nodes to reboot and remain cordoned until resolved.
  • InfiniBand or Ethernet link flaps: CoreWeave cordons Nodes with unstable network connections (intermittent connectivity) so workloads aren’t placed on unreliable Nodes.
  • Connectivity failures: If Kubernetes detects critical network issues, CoreWeave cordons the affected Nodes until stability is restored.

Temporary issues

Temporary issues detected through Kubernetes health checks can lead to short-term Node cordoning. These usually resolve automatically, and Nodes become schedulable again without intervention.

Scheduled maintenance

CoreWeave intentionally cordons Nodes during scheduled maintenance.

Lifecycle management

CoreWeave manages Node lifecycle events, such as provisioning, decommissioning, reboots, and upgrades. Temporary cordoning may occur when Nodes transition between different operational states to maintain cluster stability and reliability during lifecycle transitions.

What to do when a Node is cordoned

If you notice that one of your Nodes is cordoned, the following actions can help you assess the situation and decide whether to intervene.

Check Node status

Confirm the Node state through your Grafana dashboards or Kubernetes tools to verify the reason for cordoning. If workloads remain running and stable, you don’t need to take immediate action.

Monitor workloads

Ensure your workloads have redundancy or are configured for high availability to tolerate temporary Node unavailability. If workloads become impacted, evaluate whether manual intervention is necessary, such as deleting affected Pods to trigger rescheduling.

Persistent issues

If a Node remains cordoned for an extended period or your workload is severely impacted, open a support ticket to inform CoreWeave about the issue. Clearly document your observations, such as the Node status and workload behavior, in your support request.

Important considerations

Keep the following considerations in mind when working with cordoned Nodes:
  • Don’t rely on Node conditions defined by CoreWeave for automation: The Node conditions that CoreWeave sets are intended solely for internal CoreWeave use. Don’t modify them or rely on them for your own automation or management. You can still add your own custom Node conditions.
  • Cordoning does not always indicate a serious or permanent Node issue: Nodes often recover automatically after temporary issues. If the issue is severe or persistent, CoreWeave proactively moves affected Nodes out of production into triage.

Support

If you need assistance or have questions regarding Node cordoning, contact CoreWeave support.
Last modified on June 10, 2026