HPC Verification is a proactive, on-Node testing framework that validates hardware and driver integrity beyond what passive monitoring can detect. This system runs automatically on idle compute Nodes to catch and fix issues such as silent data corruption, performance regressions, and thermal deficiencies before they impact customer workloads. HPC Verification is a fleet-wide safeguard that underpins our SLA by ensuring your AI workloads run on robust and reliable infrastructure. This test identifies hardware or driver issues that passive monitoring alone cannot detect. It includes checks for:Documentation Index
Fetch the complete documentation index at: https://docs.coreweave.com/llms.txt
Use this file to discover all available pages before exploring further.
- Silent data corruption: Detects memory errors that do not trigger standard system alerts
- Numerical performance regression: Benchmarks common deep-learning kernels (FP8, FP16, BF16) to detect slowdowns
- Thermal deficiencies: Runs all streaming multiprocessors at 100% load for approximately 20 minutes to uncover thermal issues
hpc-verification Pod in your cluster, it was preempted immediately to make way for your job.
Scheduling
At the top of each hour, if a Node is idle (no customer workloads are running), CoreWeave runs a 20–30 minute verification test. This test uses all GPUs and available InfiniBand resources to ensure the hardware is fully and uniformly exercised without interrupting your jobs. Thehpc-verification Pods run at a low Kubernetes priority class to allow preemption and avoid blocking customer workloads. If the Node is not idle, the test is skipped and will retry at the next hour mark. This design ensures that the verification process does not interfere with your jobs.
If you are using a third-party scheduler to run your workloads, confirm that it respects Kubernetes PriorityClasses and preemption. Please contact CoreWeave Support or email [email protected] if you have questions about HPC Verification and how it interacts with your scheduler.
Customer impact
HPC Verification jobs are not billable and do not appear in usage reports. They run only when Nodes are idle, so your workloads never share GPU resources with a test. If you launch a workload, the test stops instantly, and resources are released before your Pod starts running. Your job and the test never run at the same time.Identifying the test
You may briefly see a Job namedhpc-verification-* in kubectl get pods -A while the test runs. The test uses a Kubernetes PriorityClass of cw-hpc-verification with a value of -1, meaning it always runs at a lower priority than customer workloads.
While the test is running, you might notice short spikes in your Grafana dashboards for:
- GPU SM Utilization: Measures how heavily the GPU cores are being used
- CPU Utilization: Shows CPU usage during the test
- GPU Memory Utilization: Tracks GPU memory usage
Key points
- HPC Verification tests cannot be disabled. They are a core safeguard that supports our SLA.
- You may briefly see an
hpc-verification-*Pod inkubectl get pods -A. This means the test is running on an idle Node. It will be preempted as soon as your workload starts. - These tests are not billed and do not appear in usage reports.
[email protected]. Include the Cluster name, Node name (from kubectl describe node), and the approximate time. Our team will check and, if needed, uncordon or replace the Node to restore full capacity.