Node Life Cycle
CoreWeave's management of a Node's life cycle
Managing specialized infrastructure at CoreWeave's scale is complex. It requires a high degree of automation to efficiently set up infrastructure, validate Nodes before deployment, enhance their performance, and oversee their operation throughout their lifecycle. Our Nodes operate as stateless entities, without any local data storage. When they boot, the Nodes require programming to get their specific configurations. Our automation not only applies these configurations, but goes much further, preemptively identifying and resolving issues before they impact our customers.
Learn why Node lifecycle management is critical for AI applications in our interview with Navarre Pratt, a Solutions Architect at CoreWeave.
CoreWeave manages and optimizes the full life cycle of each Node, from Day 0 to Day 2 and beyond:
- Day 0: Initially configuring a new Node at power-on.
- Day 1: Preparing the Node for its entry into the production fleet.
- Day 2+: Continually ensuring that the production fleet always operates within set specifications.
CoreWeave's advanced automation across the life cycle minimizes the time to bring new Nodes into the production fleet, maximizes the reliability of the Nodes in a fleet, and minimizes the disruption and downtime to the fleet if and when a Node ultimately fails.
Day 0: Initialization
Day 0 is when CoreWeave executes all the necessary initialization steps to prepare the Node for Day 1 activities.
After a Node powers on, the Node enters CoreWeave's management cluster where it discovers essential details such as its boot image and network setup. It also fetches vital cloud-init data, including the Kubernetes API server's IP address and the Node's join token. When complete, the Node is automatically transitioned to the Onboard state.
Day 1: Preparing for Production
Day 1 is the pre-production phase, where CoreWeave automatically runs Nodes through an intense battery of tests to prepare for delivery to customers.
After reaching the Onboard state, the Node is moved through a series of stages that include firmware updates, rigorous validation testing, cable verification, and a suite of other reliability assessments. This process ensures that each Node meets CoreWeave's high standards for performance and reliability, and is prepared to join the production fleet.
Thanks to this automated process, CoreWeave seamlessly provisions Nodes around the clock, ensuring a constant state of readiness and operational excellence.
Learn more about our Day 1 validation automation.
Day 2+: Continuous Production Monitoring
Day 2+ is the period when a Node is in production and available to a customer. CoreWeave ensures all Nodes are operating within set specifications, not only when the Nodes are delivered into production, but on a perpetual basis to ensure customers are always getting the most value from their Node fleets.
Any deviation from our specifications automatically triggers a lifecycle event designed to rectify the identified issues, maintaining the fleet's integrity and performance. By vigilantly ensuring the fleet's operational integrity, CoreWeave identifies and automatically rectifies issues to uphold our standard of reliability. For situations where issues cannot be resolved via a predetermined set of remediation strategies, the affected Node is seamlessly transitioned out of production to prevent any potential impact on service quality.
Nodes that are removed from a Customer's production cluster are automatically replaced with new Nodes, ensuring that the cluster remains at full capacity. Before a failed Node returns to the production fleet, it undergoes the full onboarding suite of tests. This process requires up to 48 hours to verify the Node is ready to resume production workloads.
Learn more about our Day 2+ validation automation.
Non CoreWeave-managed Nodes
For Nodes not running CKS or SUNK (Slurm on Kubernetes), our comprehensive suite of lifecycle automation and validation services offer these essential features:
- Node Life Cycle management: Initial onboarding and the Zap process are available at first delivery. This includes automatic upgrades and configurations for various components such as BMC, BIOS, HMC, and GPUs. However, it's important to note that upgrades for InfiniBand HCA are not supported.
- Passive InfiniBand fault detection: Our system monitors InfiniBand fabric events, transceiver status, fabric and Node topology. Node link flaps events are tracked, but automatic lifecycle actions in response to these detections are not performed.
- InfiniBand layout and connectivity checks: Comprehensive validation for InfiniBand fabric, including leaf-to-Node cabling and overall topology integrity, is fully supported.
- Manual InfiniBand connectivity validation: We also perform manual, weekly validation checks to ensure continuous InfiniBand connectivity and performance.
These tailored services ensure that even Nodes outside the CoreWeave-managed ecosystem benefit from critical lifecycle and connectivity validations, maintaining operational excellence and reliability.