Learn why Node lifecycle management is critical for AI applications in our interview with Navarre Pratt, a Solutions Architect at CoreWeave.
- Day 0: Initial configuration of a new Node at power-on.
- Day 1: Preparation of the Node for entry into the production fleet.
- Day 2+: Continual assurance that the production fleet always operates within set specifications.
Day 0: Initialization
is when CoreWeave executes all the necessary initialization steps to prepare the Node for Day 1 activities. After a Node powers on, the Node enters CoreWeave’s management cluster where it receives configuration details such as its boot image and network setup. It also fetches cloud-init data, including the Kubernetes API server’s IP address and the Node’s join token. When complete, the Node automatically transitions to the Onboard state.Day 1: Preparation for production
is the pre-production phase, where CoreWeave automatically moves Nodes through a series of stages including firmware updates, validation testing, cable verification, and reliability assessments. This process ensures each Node meets CoreWeave’s standards for performance and reliability before it joins the production fleet. Learn more about Day 1 validation automation.Day 2+: Continuous production monitoring
is the period when a Node is in production and available to you. CoreWeave continuously verifies that Nodes operate within set specifications, combining active health checks, passive monitoring, and automated InfiniBand validation to keep fleets reliable and performant. Learn more about Day 2+ validation automation.Non-CoreWeave-managed Nodes
This section describes which lifecycle automation and validation services are available for Nodes that aren’t managed by CoreWeave’s Kubernetes services. For Nodes not running or SUNK, CoreWeave’s lifecycle automation and validation services offer these features:- Node lifecycle management: Initial onboarding and the Zap process are available at first delivery. This includes automatic upgrades and configurations for various components such as BMC, BIOS, HMC, and GPUs. However, upgrades for InfiniBand HCA aren’t supported.
- Passive InfiniBand fault detection: The system monitors InfiniBand fabric events, transceiver status, and fabric and Node topology. Node link flap events are tracked, but automatic lifecycle actions in response to these detections aren’t performed.
- InfiniBand layout and connectivity checks: Validation for InfiniBand fabric, including leaf-to-Node cabling and topology integrity, is supported.
- Manual InfiniBand connectivity validation: CoreWeave also performs manual, weekly validation checks to confirm continuous InfiniBand connectivity and performance.