Skip to main content

Hardware Observability

Understand hardware observability features

CKS provides deep hardware transparency and observability with out-of-the-box telemetry. Hardware observability is immediately available upon deployment, requiring no customer setup or configuration and includes the following:

  • Managed Telemetry Platform: CoreWeave manages a high-performance platform for ingesting, storing, and exposing telemetry data.
  • Purpose-Built Dashboards: CoreWeave provides curated dashboards that offer insights and leverage CoreWeave's expertise in managing large GPU supercomputing fleets.
  • Cost-Free Access: All observability features are included at no additional charge for CKS customers.

Hardware telemetry layers

This section details the telemetry available across different layers of the hardware stack.

GPU chip telemetry

CKS provides chip-level metrics via NVIDIA's Data Center GPU Management (DCGM) tool, pre-installed on all servers. Developers can query these metrics through our metrics query service or view them in the Node Details dashboard.

These metrics help identify failures, performance issues, and power inefficiencies. Use them to optimize job performance and resource utilization. CoreWeave provides pre-configured alerts for critical thresholds.

Server Telemetry

CKS offers transparency at the physical server level, providing metrics and statuses for the entire server.

CategoryDetails
IPMI Metrics
  • Electrical current
  • Power
  • Power consumption (server-wide)
  • Voltage
  • Temperature
  • Fan speed
  • Fan speed ratio
  • System Event Log (SEL) free space
  • System Event Log (SEL) log count
IPMI Statuses
  • Chassis power state
  • Electrical current sensor state
  • Fan speed sensor state
  • Power sensor state
  • Cable/interconnect VGA cable presence
  • Power supply status
  • Serial cable status
  • PCIe slot critical interrupt
  • Physical security intrusion
  • Fan redundancy state
  • CMOS battery state
  • Temperature sensor state
  • Voltage sensor state
Node Problem and Lifecycle ConditionsA set of conditions from kernel logs and temperature metrics that offer insight into node problems and lifecycle stages. Examples include SlurmCordon, KernelDeadlock, InfiniBandLinkFault, GPUFallenOffBus, and DNSFailure.
Storage Performance Metrics
  • Read Bandwidth
  • Write Bandwidth
  • Average Request Time by Operation
  • Queue Time by Operation and VIP
  • Timeouts by Operation
  • Average Response Time by Operation
  • Transmissions
  • Retransmissions
  • Mean Queue Time by Mount Address
InfiniBand Interface Metrics
  • node_infiniband_port_data_received_bytes_total
  • node_infiniband_port_data_transmitted_bytes_total
  • node_infiniband_port_transmit_wait_total
  • node_infiniband_rate_bytes_per_second

HPC Verification Telemetry

CoreWeave's High Performance Computing (HPC) Verification ensures high performance under sustained compute loads. Alerts are presented to customers for various failure conditions during HPC verification, such as:

  • NodeVerificationFailure
  • NodeVerificationFailureDCGMBadMemory
  • NodeVerificationFailureGPUBlazeTLimit
  • NodeVerificationFailureGPUBlazeThrottling
  • NodeVerificationFailureGPUBurn
  • NodeVerificationFailureGPUBurnTLimit
  • NodeVerificationFailureMemBW
  • NodeVerificationFailureNVFabric
  • NodeVerificationFailureNVLink
  • NodeVerificationFailureNonDeterminism
  • NodeVerificationIncomplete
  • NodeVerificationSRAMThresholdExceeded

Additionally, log telemetry from all HPC verification tests is available, aiding in troubleshooting workload performance issues.

This comprehensive suite of hardware observability tools empowers software developers to optimize their applications, diagnose issues, and ensure maximum utilization and efficiency on the CoreWeave Kubernetes Service.