Skip to main content

Day 2+

An automated lifecycle management and validation platform

CoreWeave has engineered Day 2+ monitoring, testing, and remediation to elevate the performance and reliability of our infrastructure. Managing the intricacies of Node validation, monitoring, and optimization in automated fashion is critical to ensuring our fleet operates at peak efficiency from deployment to runtime. We conduct critical health assessments and continuous surveillance at both the Node and InfiniBand fabric levels, facilitating quicker provisioning and more effective error detection.

By simplifying infrastructure setup and ongoing management, CoreWeave's Day 2+ automation enhances platform stability and performance. This efficiency empowers our customers to focus more on developing, training, and deploying their models, accelerating their time to market.

How it works

CoreWeave's Day 2+ automation employs a dual approach to monitoring: active health checks during idle periods and passive surveillance when Nodes are active. It scrutinizes in-band and out-of-band metrics, along with system logs, for any anomalies. Detected issues trigger a Node life cycle event, prompting immediate remediation actions. Additionally, it conducts automated InfiniBand fabric tests to guarantee system dependability.

  • Continuous Health Checks: CKS Nodes undergo continuous health assessments to ensure they're always ready for production.
  • Passive Monitoring: Beyond active testing, our automation keeps a vigilant eye on all metrics and system logs, ready to act upon any alerts or deviations by initiating Node life cycle events to resolve any issues.
  • Automated InfiniBand Validation: A robust automated system validates the InfiniBand topology multiple times daily, ensuring all connections align with the network design. Detected discrepancies generate tickets for data center technicians to address.
  • Trend Analysis: Stored test outcomes facilitate trend analysis, enabling us to anticipate failures and fine-tune the system for enhanced performance and reliability.
  • Manual InfiniBand Testing: Weekly, our InfiniBand team conducts thorough evaluations to identify any irregular metrics, ensuring the fabric's optimal performance.

Thanks to this automation, CoreWeave can provision Nodes more rapidly, identify issues sooner, and preempt performance bottlenecks, allowing our customers to dedicate more time to their core activities and bring their products to market more efficiently.