Node Pools
Manage groups of Nodes as a single entity with Node Pools
In CKS, hardware resources are represented as Kubernetes Nodes. In CKS, users manage the Nodes in their clusters as a single, customizable entity called a Node Pool.
A Node Pool is a logical grouping of Nodes of the same instance type, featuring the same labels, taints, and annotations. Using Node Pools, users can set the type and number of Nodes desired in a cluster. Once a Node Pool resource is deployed to a cluster, CKS continuously monitors the Node Pool to ensure that the number of running Nodes matches the number specified in the Node Pool manifest.
It is possible to deploy multiple Node Pools within a single cluster, where each Node Pool may contain any number of Nodes.
Nodes are currently added to clusters manually by CoreWeave. Please contact support for assistance.
Control Plane Node Pool
Each cluster is provisioned with an initial Node Pool featuring two CPU Nodes, which run the Kubernetes Control Plane components such as the CSI, the CNI, cluster DNS, metrics, and other components. This Node Pool, called cpu-control-plane
, is automatically created when a CKS cluster is created, and appears in the Node Pool list on the Cloud Console once a CKS cluster is in a Healthy
state.
Node cordoning
In CKS, Nodes are sometimes cordoned in order to ensure that workloads are only scheduled to healthy Nodes. In most cases, cordoning is eventually removed, making the Node schedulable again. Node cordoning is managed entirely by CoreWeave.
Users should not leverage Node conditions
for automation, as the Node conditions are intended for internal use only. CoreWeave may cordon Nodes for maintenance purposes, or, to resolve temporary issues. Conditions are not intended for clients to use for designing their own custom management automation.
There are several reasons why a Node may be cordoned:
- Maintenance: If a Node requires maintenance, updates, or hardware fixes, CKS cordons it to ensure no new workloads are placed on it during that time. This allows the Node Life Cycle controller to make necessary changes without disrupting running tasks.
- Node draining for removal: If a Node needs to be removed from the cluster, the Node will be cordoned prior to draining it. Workloads are automatically rescheduled onto healthy Nodes, and no new workloads will be scheduled to the cordoned Node.
- InfiniBand or Ethernet link flaps: Link flaps are intermittent, unpredictable up-down transitions in a network connection, which can result in networking or communication failures. If an InfiniBand or Ethernet link is flapping, a Node can experience inconsistent or unreliable connectivity. In this case, the Node is cordoned to ensure no workloads are scheduled to a Node with an unreliable network connection.
- Temporary health check failures: Kubernetes uses various health checks to assess the state of a user's system. A temporary check fail might indicate transient issues that could degrade Node performance. The Node is cordoned until the issue is resolved.
Users should not assume that cordoned Nodes have substantial or permanent issues. If a cordoned Node is deemed to have a fault that cannot be easily resolved, CKS will move it out of production and into triage.
Cordoning Nodes in these cases allows CKS to prevent disruptions. If you have questions about Node cordoning, or would like to manually cordon Nodes for another reason, please contact support.