CoreWeave has developed advanced Day 2+ systems that enhance the reliability and performance of our infrastructure. Through automated validation, continuous monitoring, and rapid remediation, we ensure that every component, from compute Nodes to InfiniBand fabrics, operates efficiently from initial deployment through production.Documentation Index
Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
Use this file to discover all available pages before exploring further.
How it works
Our systems perform ongoing health checks and fabric diagnostics, enabling faster cluster provisioning and early fault detection. This proactive approach keeps our infrastructure running at peak efficiency. To ensure your workloads remain reliable, CoreWeave leverages a combination of active testing and passive monitoring to identify and resolve issues, often before they impact your workloads. The following sections describe active health checks, passive monitoring and remediation, InfiniBand validation, and related practices.Active health checks
To ensure Node reliability, CoreWeave runs periodic HPC Verification tests that use all GPUs on a Node for approximately 20 minutes. These tests execute once per hour and are visible in Grafana dashboards, where you’ll see a spike in GPU SM Utilization, CPU Utilization, and GPU Memory Utilization metrics. The GPU SM Utilization metric indicates how actively the GPU’s compute cores (streaming multiprocessors) are being used. High values reflect the compute-intensive activity during the test window. These spikes have a consistent pattern across all GPUs on each Node, as shown in the example below.
Some open-source schedulers, like Volcano, may not support automatic eviction of our verification tests. If you’re using a custom scheduler and have questions about active health check behavior, please contact CoreWeave Support.
Passive monitoring
When workloads are active, we collect and analyze both in-band and out-of-band telemetry. We also monitor logs to detect anomalies. When issues arise, we trigger automated remediation through Node lifecycle events.Automated remediation and Node replacement
Any deviation from specifications automatically triggers a lifecycle event designed to rectify the identified issues, maintaining the fleet’s integrity and performance. For situations where issues cannot be resolved through a predetermined set of remediation strategies, the affected Node is seamlessly transitioned out of production to prevent any potential impact on service quality. Nodes that are removed from your production cluster are automatically replaced with new Nodes, ensuring that the cluster remains at full capacity. Before a failed Node returns to the production fleet, it undergoes the full onboarding suite of tests. This process requires up to 48 hours to verify the Node is ready to resume production workloads.Automated InfiniBand validation
We test the InfiniBand fabric multiple times daily. Any deviations from the intended topology raise automatic tickets for data center technicians.Trend analysis
Historical test data is used to identify patterns, predict failures, and fine-tune performance over time.Manual InfiniBand testing
Our network team also performs weekly manual inspections to catch rare or complex issues that automation might miss.Why this matters
With CoreWeave’s automated infrastructure lifecycle:- Node provisioning is faster.
- System health issues are caught earlier.
- Performance bottlenecks are minimized.