Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt

Use this file to discover all available pages before exploring further.

Managing specialized infrastructure at CoreWeave’s scale is complex. It requires a high degree of automation to efficiently set up infrastructure, validate Nodes before deployment, enhance their performance, and oversee their operation throughout their lifecycle. Our Nodes operate as stateless entities, without any local data storage. When they boot, the Nodes require programming to get their specific configurations. Our automation not only applies these configurations, but goes much further, preemptively identifying and resolving issues before they impact our customers.
Learn why Node lifecycle management is critical for AI applications in our interview with Navarre Pratt, a Solutions Architect at CoreWeave.
CoreWeave manages and optimizes the full lifecycle of each Node, from Day 0 to Day 2 and beyond:
  • Day 0: Initially configuring a new Node at power-on.
  • Day 1: Preparing the Node for its entry into the production fleet.
  • Day 2+: Continually ensuring that the production fleet always operates within set specifications.
The sections that follow cover each phase at a high level. Each Day 1 and Day 2+ section links to a dedicated page for deeper detail. CoreWeave’s advanced automation across the lifecycle minimizes the time to bring new Nodes into the production fleet, maximizes the reliability of the Nodes in a fleet, and minimizes the disruption and downtime to the fleet if and when a Node ultimately fails.

Day 0: Initialization

is when CoreWeave executes all the necessary initialization steps to prepare the Node for Day 1 activities. After a Node powers on, the Node enters CoreWeave’s management cluster where it discovers essential details such as its boot image and network setup. It also fetches vital cloud-init data, including the Kubernetes API server’s IP address and the Node’s join token. When complete, the Node is automatically transitioned to the Onboard state.

Day 1: Preparing for production

is the pre-production phase, where CoreWeave automatically moves Nodes through a series of stages including firmware updates, rigorous validation testing, cable verification, and reliability assessments. This process ensures each Node meets CoreWeave’s standards for performance and reliability before joining the production fleet. Learn more about Day 1 validation automation.

Day 2+: Continuous production monitoring

is the period when a Node is in production and available to you. CoreWeave ensures all Nodes operate within set specifications on a perpetual basis, combining active health checks, passive monitoring, and automated InfiniBand validation to keep fleets reliable and high-performing. Learn more about Day 2+ validation automation.

Non-CoreWeave-managed Nodes

For Nodes not running or SUNK, our comprehensive suite of lifecycle automation and validation services offers these essential features:
  • Node lifecycle management: Initial onboarding and the Zap process are available at first delivery. This includes automatic upgrades and configurations for various components such as BMC, BIOS, HMC, and GPUs. However, upgrades for InfiniBand HCA are not supported.
  • Passive InfiniBand fault detection: Our system monitors InfiniBand fabric events, transceiver status, fabric and Node topology. Node link flaps events are tracked, but automatic lifecycle actions in response to these detections are not performed.
  • InfiniBand layout and connectivity checks: Comprehensive validation for InfiniBand fabric, including leaf-to-Node cabling and overall topology integrity, is fully supported.
  • Manual InfiniBand connectivity validation: We also perform manual, weekly validation checks to ensure continuous InfiniBand connectivity and performance.
These tailored services ensure that even Nodes outside the CoreWeave-managed ecosystem benefit from critical lifecycle and connectivity validations, maintaining operational excellence and reliability.
Last modified on April 13, 2026