HPC Verification is a proactive, on-Node testing framework that validates hardware and driver integrity beyond what passive monitoring can detect. This system runs automatically on idle compute Nodes to catch and fix issues such as silent data corruption, performance regressions, and thermal deficiencies before they affect customer workloads.
HPC Verification is a fleet-wide safeguard that underpins the CoreWeave SLA by ensuring your AI workloads run on reliable infrastructure. This test identifies hardware or driver issues that passive monitoring alone can’t detect. It includes checks for:
- Silent data corruption: Detects memory errors that do not trigger standard system alerts.
- Numerical performance regression: Benchmarks common deep-learning kernels (FP8, FP16, BF16) to detect slowdowns.
- Thermal deficiencies: Runs all streaming multiprocessors at 100% load for approximately 20 minutes to uncover thermal issues.
These active tests complement CoreWeave’s passive telemetry collection by identifying and remediating issues before they affect customer workloads. They are a key part of CoreWeave’s infrastructure lifecycle management, helping maintain reliability and performance for your AI workloads. Because they run only on idle Nodes and stop as soon as a customer workload is scheduled, they don’t interfere with your jobs. If you see a transient hpc-verification Pod in your cluster, it was preempted to make way for your job.
Test schedule
At the top of each hour, if a Node is idle (no customer workloads are running), CoreWeave runs a 20- to 30-minute verification test. This test uses all GPUs and available InfiniBand resources to exercise the hardware uniformly without interrupting your jobs. The hpc-verification Pods run at a low Kubernetes priority class to allow preemption and avoid blocking customer workloads. If the Node is not idle, CoreWeave skips the test and retries at the next hour mark.
If you are using a third-party scheduler to run your workloads, you must confirm that it respects Kubernetes PriorityClasses and preemption. Contact CoreWeave Support or email support@coreweave.com if you have questions about HPC Verification and how it interacts with your scheduler.
Customer impact
HPC Verification jobs are not billable and don’t appear in usage reports. They run only when Nodes are idle, so your workloads never share GPU resources with a test. If you launch a workload, the test stops, and CoreWeave releases resources before your Pod starts running. Your job and the test never run at the same time.
Identify the test
Use the following signals to confirm that an hpc-verification Pod or metric spike on your Node comes from HPC Verification rather than a customer workload.
You might briefly see a Job named hpc-verification-* in kubectl get pods -A while the test runs. The test uses a Kubernetes PriorityClass of cw-hpc-verification with a value of -1, so it always runs at a lower priority than customer workloads.
While the test is running, you might notice short spikes in your Grafana dashboards for:
- GPU SM Utilization: Measures how heavily the GPU cores are used.
- CPU Utilization: Shows CPU usage during the test.
- GPU Memory Utilization: Tracks GPU memory usage.
These metrics confirm that the test uniformly exercises every GPU on the Node.
Key points
- HPC Verification tests can’t be disabled. They are a core safeguard that supports the CoreWeave SLA.
- You might briefly see an
hpc-verification-* Pod in kubectl get pods -A. This means the test is running on an idle Node. Kubernetes preempts it as soon as your workload starts.
- These tests aren’t billed and don’t appear in usage reports.
If you think HPC Verification has cordoned a Node in your cluster or delayed your job, email support@coreweave.com. Include the cluster name, Node name (from kubectl describe node), and the approximate time. The CoreWeave team checks and, if needed, uncordons or replaces the Node to restore capacity. Last modified on June 4, 2026