Managing specialized infrastructure at CoreWeave’s scale is complex. It requires a high degree of automation to efficiently set up infrastructure, validate Nodes before deployment, enhance their performance, and oversee their operation throughout their lifecycle. Our Nodes operate as stateless entities, without any local data storage. When they boot, the Nodes require programming to get their specific configurations. Our automation not only applies these configurations, but goes much further, preemptively identifying and resolving issues before they impact our customers.Documentation Index
Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
Use this file to discover all available pages before exploring further.
Learn why Node lifecycle management is critical for AI applications in our interview with Navarre Pratt, a Solutions Architect at CoreWeave.
- Day 0: Initially configuring a new Node at power-on.
- Day 1: Preparing the Node for its entry into the production fleet.
- Day 2+: Continually ensuring that the production fleet always operates within set specifications.
Day 0: Initialization
is when CoreWeave executes all the necessary initialization steps to prepare the Node for Day 1 activities. After a Node powers on, the Node enters CoreWeave’s management cluster where it discovers essential details such as its boot image and network setup. It also fetches vital cloud-init data, including the Kubernetes API server’s IP address and the Node’s join token. When complete, the Node is automatically transitioned to the Onboard state.Day 1: Preparing for production
is the pre-production phase, where CoreWeave automatically moves Nodes through a series of stages including firmware updates, rigorous validation testing, cable verification, and reliability assessments. This process ensures each Node meets CoreWeave’s standards for performance and reliability before joining the production fleet. Learn more about Day 1 validation automation.Day 2+: Continuous production monitoring
is the period when a Node is in production and available to you. CoreWeave ensures all Nodes operate within set specifications on a perpetual basis, combining active health checks, passive monitoring, and automated InfiniBand validation to keep fleets reliable and high-performing. Learn more about Day 2+ validation automation.Non-CoreWeave-managed Nodes
For Nodes not running or SUNK, our comprehensive suite of lifecycle automation and validation services offers these essential features:- Node lifecycle management: Initial onboarding and the Zap process are available at first delivery. This includes automatic upgrades and configurations for various components such as BMC, BIOS, HMC, and GPUs. However, upgrades for InfiniBand HCA are not supported.
- Passive InfiniBand fault detection: Our system monitors InfiniBand fabric events, transceiver status, fabric and Node topology. Node link flaps events are tracked, but automatic lifecycle actions in response to these detections are not performed.
- InfiniBand layout and connectivity checks: Comprehensive validation for InfiniBand fabric, including leaf-to-Node cabling and overall topology integrity, is fully supported.
- Manual InfiniBand connectivity validation: We also perform manual, weekly validation checks to ensure continuous InfiniBand connectivity and performance.