CoreWeave Kubernetes Service (CKS) provides hardware transparency and observability with built-in telemetry. Hardware observability is available immediately on deployment and requires no setup or configuration. CKS includes the following hardware observability features:
- Managed telemetry platform: CoreWeave manages a high-performance platform for ingesting, storing, and exposing telemetry data.
- Purpose-built dashboards: CoreWeave provides curated dashboards that offer insights and draw on CoreWeave expertise in managing large GPU supercomputing fleets.
- No additional cost: All observability features are included for CKS customers at no additional charge.
Hardware telemetry layers
The following sections describe the telemetry available across different layers of the hardware stack.
GPU chip telemetry
CKS provides chip-level metrics through NVIDIA’s Data Center GPU Management (DCGM) tool, which is pre-installed on all servers. You can query these metrics through the CoreWeave metrics query service or view them in the Node Details dashboard.
These metrics help identify failures, performance issues, and power inefficiencies. Use them to optimize job performance and resource utilization. CoreWeave provides pre-configured alerts for critical thresholds.
Server telemetry
CKS offers transparency at the physical server level, providing metrics and statuses for the entire server.
| Category | Details |
|---|
| IPMI Metrics | - Electrical current
- Power
- Power consumption (server-wide)
- Voltage
- Temperature
- Fan speed
- Fan speed ratio
- System Event Log (SEL) free space
- System Event Log (SEL) log count
|
| IPMI Statuses | - Chassis power state
- Electrical current sensor state
- Fan speed sensor state
- Power sensor state
- Cable/interconnect VGA cable presence
- Power supply status
- Serial cable status
- PCIe slot critical interrupt
- Physical security intrusion
- Fan redundancy state
- CMOS battery state
- Temperature sensor state
- Voltage sensor state
|
| Node Problem and Lifecycle Conditions | A set of conditions from kernel logs and temperature metrics that offer insight into node problems and lifecycle stages. Examples include SlurmCordon, KernelDeadlock, InfiniBandLinkFault, GPUFallenOffBus, and DNSFailure. |
| Storage Performance Metrics | - Read Bandwidth
- Write Bandwidth
- Average Request Time by Operation
- Queue Time by Operation and VIP
- Timeouts by Operation
- Average Response Time by Operation
- Transmissions
- Retransmissions
- Mean Queue Time by Mount Address
|
| InfiniBand Interface Metrics | node_infiniband_port_data_received_bytes_totalnode_infiniband_port_data_transmitted_bytes_totalnode_infiniband_port_transmit_wait_totalnode_infiniband_rate_bytes_per_second
|
HPC verification telemetry
CoreWeave’s high-performance computing (HPC) verification helps maintain high performance under sustained compute loads. HPC verification raises alerts for failure conditions such as the following:
NodeVerificationFailure
NodeVerificationFailureDCGMBadMemory
NodeVerificationFailureGPUBlazeTLimit
NodeVerificationFailureGPUBlazeThrottling
NodeVerificationFailureGPUBurn
NodeVerificationFailureGPUBurnTLimit
NodeVerificationFailureMemBW
NodeVerificationFailureNVFabric
NodeVerificationFailureNVLink
NodeVerificationFailureNonDeterminism
NodeVerificationIncomplete
NodeVerificationSRAMThresholdExceeded
Log telemetry from all HPC verification tests is also available to help you troubleshoot workload performance issues.
Together, these tools help you optimize applications, diagnose issues, and improve resource utilization on CKS. Last modified on June 4, 2026