How do I monitor GPU usage per Pod?

GPU metrics on CKS are collected automatically by the dcgm-exporter cluster component, which scrapes NVIDIA DCGM metrics from every GPU Node. The Pod Inspector dashboard in Managed Grafana shows per-Pod CPU, memory, network, and logs, alongside GPU Usage and GPU Memory Usage panels that reflect the GPU on the Pod’s Node. Filter the dashboard to a Pod in any namespace. The GPU panels do not show data for SUNK Slurm slurmd Pods, because Slurm jobs run in separate Node cgroups. For GPU usage on Slurm jobs, use the Slurm Job Metrics dashboard instead. To query the underlying DCGM series (for example, DCGM_FI_DEV_GPU_UTIL for utilization and DCGM_FI_DEV_FB_USED for framebuffer memory) directly with PromQL, use the metrics API.

Administrator