Node Cordoning
Understanding Node cordoning in CoreWeave Kubernetes Service (CKS)
This guide explains why Nodes in CKS become cordoned, what this means, and how you should respond.
What is Node cordoning?
In Kubernetes, a cordoned Node prevents new workloads from scheduling onto it. Existing workloads on the Node continue to run until explicitly drained or completed. Cordoning typically indicates maintenance, troubleshooting, or reliability actions.
CoreWeave cordons Nodes proactively to ensure reliability, maintain cluster health, and manage maintenance activities with minimal impact on workloads.
Reasons for Node cordoning
Nodes in CKS may be cordoned for various reasons:
Maintenance and updates
Nodes requiring hardware maintenance, kernel updates, or hardware replacements are cordoned to avoid scheduling new workloads. This practice allows maintenance activities without disrupting running workloads.
Health monitoring
CoreWeave monitors Node health and performance. If a Node exhibits issues that could affect workload reliability, it may be cordoned to prevent new workloads from being scheduled. These issues can include:
- Hardware Issues: GPU hardware errors trigger automatic Node cordoning and rebooting.
- Kernel deadlocks: Severe kernel issues may also cause Nodes to reboot and remain cordoned until resolved.
- InfiniBand or Ethernet link flaps: Nodes with unstable network connections (intermittent connectivity) are cordoned to ensure workloads are not placed on unreliable Nodes.
- Connectivity failures: If Kubernetes detects critical network issues, the affected Nodes are cordoned until stability is restored.
Temporary issues
Temporary issues detected via Kubernetes health checks can lead to short-term Node cordoning. These are usually resolved automatically, and Nodes become schedulable again without intervention.
Scheduled maintenance
Nodes are intentionally cordoned during scheduled maintenance.
Lifecycle management
CoreWeave manages Node lifecycle events, such as provisioning, decommissioning, reboots, and upgrades. Temporary cordoning may occur when Nodes transition between different operational states to maintain cluster stability and reliability during lifecycle transitions.
What to do when a Node is cordoned
When you observe a Node is cordoned, consider the following actions:
Check Node status
Confirm the Node state via your Grafana dashboards or Kubernetes tools to verify the reason for cordoning. If workloads remain running and stable, no immediate action is required.
Monitor workloads
Ensure your workloads have redundancy or are configured for high availability to tolerate temporary Node unavailability. If workloads become impacted, evaluate if manual intervention, such as deleting affected Pods to trigger rescheduling, is necessary.
Persistent issues
If a Node remains cordoned for an extended period or if your workload is severely impacted, open a support ticket to inform CoreWeave about the issue. Clearly document your observations, such as the Node status and workload behavior in your support request.
Important Considerations
- Avoid using Node conditions for automation: Node conditions are intended solely for internal CoreWeave use. Do not rely on them for custom automation or management.
- Cordoning does not always indicate a serious or permanent Node issue: Nodes often recover automatically after temporary issues. If the issue is severe or persistent, CoreWeave proactively moves affected Nodes out of production into triage.
Support
If you need assistance or have questions regarding Node cordoning, please contact CoreWeave support.