HPC Verification

Active health checks for Node reliability

HPC Verification is a proactive, on-Node testing framework that validates hardware and driver integrity beyond what passive monitoring can detect. This system runs automatically on idle compute Nodes to catch and fix issues such as silent data corruption, performance regressions, and thermal deficiencies before they impact customer workloads.

HPC Verification is a fleet-wide safeguard that underpins our SLAby ensuring your AI workloads run on robust and reliable infrastructure. This test identifies hardware or driver issues that passive monitoring alone cannot detect. It includes checks for:

Silent data corruption: Detects memory errors that do not trigger standard system alerts
Numerical performance regression: Benchmarks common deep-learning kernels (FP8, FP16, BF16) to detect slowdowns
Thermal deficiencies: Runs all streaming multiprocessors at 100% load for approximately 20 minutes to uncover thermal issues

These active tests complement our passive telemetry collection by identifying and remediating issues before they affect customer workloads. They are a key part of our infrastructure lifecycle management, enabling CoreWeave to maintain high reliability and performance for your AI workloads. Because they run only on idle Nodes and terminate immediately when a customer workload is scheduled, they do not interfere with your jobs. If you see a transient hpc-verification Pod in your cluster, it was preempted immediately to make way for your job.

Scheduling

At the top of each hour, if a Node is idle (no customer workloads are running), CoreWeave runs a 20-30 minute verification test. This test utilizes all GPUs and available InfiniBand resources to ensure the hardware is fully and uniformly exercised without interrupting your jobs. If any user Pods or Slurm jobs are present, the test is skipped and will retry at the next hour mark. This design ensures that the verification process does not interfere with your jobs.

Customer impact

HPC Verification jobs are not billable and do not appear in usage reports. They run only when Nodes are idle, so your workloads never share GPU resources with a test. If you launch a workload, the test stops instantly, and resources are released before your Pod starts running. Your job and the test never run at the same time.

Identifying the test

You may briefly see a Job named hpc-verification-* in kubectl get pods -A while the test runs. The test uses a Kubernetes PriorityClass of cw-hpc-verification with a value of -1, meaning it always runs at a lower priority than customer workloads.

While the test is running, you might notice short spikes in your Grafana dashboards for:

GPU SM Utilization: Measures how heavily the GPU cores are being used
CPU Utilization: Shows CPU usage during the test
GPU Memory Utilization: Tracks GPU memory usage

These metrics confirm that the test uniformly exercises every GPU on the Node.

Key points

HPC Verification tests cannot be disabled. They are a core safeguard that supports our SLA.
You may briefly see an hpc-verification-* Pod in kubectl get pods -A. This means the test is running on an idle Node. It will be preempted as soon as your workload starts.
These tests are not billed and do not appear in usage reports.

If you think HPC Verification has cordoned a Node in your cluster or delayed your job, email [email protected]. Include the Cluster name, Node name (from kubectl describe node), and the approximate time. Our team will check and, if needed, uncordon or replace the Node to restore full capacity.

Scheduling​

Customer impact​

Identifying the test​

Key points​

Scheduling

Customer impact

Identifying the test

Key points