Day 1
Day 1 fleet management of pre-production Nodes
CoreWeave's automated Day 1 operations move Nodes through a sequence of states to ready it for production deployment.
Day 1 Node States
After Day 0, the Node transitions to the Onboard state where a data center technician (DCT) conducts final physical inspections and manages the cabling. After the DCT certifies the Node, CoreWeave automatically initiates Day 1 operations, moving the Node through a sequence of states, starting with Seatrial, to ready it for production deployment.
Seatrial
The Seatrial phase serves as a critical observation period, during which the Node is scrutinized for potential issues. Continuous automated monitoring ensures:
- Cabling: Proper connection of cables to their respective adapters
- Power: All power supplies function within their specified parameters
- Inventory Validation: Verification of the installation of correct GPUs, storage, memory, and other essential components
Following the Seatrial, the Node progresses to the Zap state.
Zap
During the Zap state, the Node undergoes a comprehensive firmware upgrade process, affecting the GPU, PCI Retimer, BMC, BIOS, among others. This procedure typically spans one to two hours.
- Successful completion of the Zap state advances the Node to the Test state.
- Failure to pass, or a test delay exceeding 6 hours, moves the Node to the Zap Fail state for further analysis.
Test
During this 24-hour period the Node undergoes extensive testing designed to uncover any underlying hardware or software anomalies.
- Passing the Test state means the Node is ready for Production.
- Any detected issues during this phase moves the Node to the Triage state.
Production
Nodes that reach the Production state are deemed ready for customer use. Their allocation and cluster assignments are managed by CKS. In the Production state, the Node remains under continuous proactive monitoring, which triggers further life cycle events if any issues are detected. To learn more, see Day 2+.
Triage and RMA
Nodes relegated to the Triage state are temporarily sidelined from production. Post-Triage, the Node is either directed to the RMA state for vendor repairs or to the Debug state for in-depth troubleshooting. If the Node is determined to be ready for redeployment, it is moved to the Onboard state, where it commences a new life cycle.
Nodes that have been refurbished by the vendor in the RMA state are reintroduced into the Onboard state, where they begin a new life cycle.
Even though customers do not interact with Nodes in the Triage and RMA states, CoreWeave's full life cycle automation, including through these states, enables customers to enjoy reliable, performant fleets without wasted time and effort spent dealing with Nodes when they do fail.