Monitoring NVIDIA GPU Metrics on Kubernetes with Prometheus: A Comprehensive Guide

Running GPU workloads on Kubernetes is common for AI, ML, and deep learning applications, but monitoring those workloads isn’t always straightforward. If you’re using NVIDIA GPUs in your Kubernetes cluster, you’ll likely want to track utilization, memory, temperature, and power consumption in Prometheus.

This guide walks you through how to expose NVIDIA GPU metrics with DCGM Exporter and integrate them into the kube-prometheus-stack.

👉 Note: DCGM Exporter does not support time-sliced GPUs. Make sure to disable GPU time-slicing if you want accurate metrics.

Monitoring NVIDIA GPUs on Kubernetes with Prometheus and Grafana.

Why Monitor NVIDIA GPU Metrics on Kubernetes?

When training or serving ML models, GPUs are often the most expensive resource in your cluster. Without visibility, you risk underutilization, bottlenecks, or overheating. By exporting GPU metrics to Prometheus, you can:

Track GPU utilization and memory pressure
Identify copy-bound workloads
Monitor power usage and temperature for stability
Set up Grafana dashboards and alerts for proactive monitoring

This ensures you’re getting the most out of your GPUs and keeping costs under control.

🔧 Prerequisites

Before starting, make sure your cluster has:

Prometheus installed (via kube-prometheus-stack)
NVIDIA GPU Operator installed with GPUs available on nodes
kubectl and helm configured for your Kubernetes cluster

⚡ Step 1: Install DCGM Exporter

There are two ways to deploy the exporter.

Option A: Install via Helm (recommended)

helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts 
&& helm repo update 
&& helm install dcgm-exporter gpu-helm-charts/dcgm-exporter

Option B: Apply the upstream manifest (quick test)

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/master/dcgm-exporter.yaml

⚡ Step 2: Create a ServiceMonitor for Prometheus

Prometheus doesn’t automatically scrape GPU metrics — you need a ServiceMonitor to tell it where to look.

First, inspect the dcgm-exporter Service:

kubectl get svc dcgm-exporter -o yaml

You should see something like:

...
ports:
- name: metrics
port: 9400
targetPort: 9400
selector:
app.kubernetes.io/name: dcgm-exporter

Now create a ServiceMonitor that matches your Prometheus release:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: dcgm-exporter-service-monitor
namespace: default
labels:
release: prometheus # Must match your kube-prometheus-stack release name
spec:
selector:
matchLabels:
app.kubernetes.io/name: dcgm-exporter
endpoints:
- port: "metrics"
path: /metrics
interval: 15s
namespaceSelector:
matchNames:
- default

👉 If your Prometheus Helm release name isn’t prometheus, update the release label accordingly.

⚡ Step 3: Validate GPU Metrics in Prometheus

Open the Prometheus UI → Status → Targets and confirm dcgm-exporter is UP.
Run a few test queries:

DCGM_FI_DEV_GPU_UTIL
DCGM_FI_DEV_MEM_COPY_UTIL
DCGM_FI_DEV_FB_USEDpr
DCGM_FI_DEV_POWER_USAGE
DCGM_FI_DEV_GPU_TEMP

Key metrics explained:

DCGM_FI_DEV_GPU_UTIL → GPU utilization % (how busy the GPU cores are)
DCGM_FI_DEV_MEM_COPY_UTIL → Memory copy engine utilization %
DCGM_FI_DEV_FB_USED → GPU memory used (MiB)
DCGM_FI_DEV_POWER_USAGE → Power draw in watts
DCGM_FI_DEV_GPU_TEMP → GPU temperature (°C)

Example PromQL aggregations:

# Average GPU utilization by node
avg by (instance) (DCGM_FI_DEV_GPU_UTIL)# Total framebuffer memory used by node (MiB)
sum by (instance) (DCGM_FI_DEV_FB_USED)
# Cluster-wide average GPU utilization
avg(DCGM_FI_DEV_GPU_UTIL)

⚡ Troubleshooting Tips

Exporter target is DOWN → Check ServiceMonitor labels match your Prometheus release.
No metrics available → Inspect exporter pod logs, verify GPU operator health.
Namespace mismatch → Ensure your Prometheus operator watches the correct namespace.

⚡ Cleanup

If you need to remove the exporter:

helm uninstall dcgm-exporter
# Or if deployed via manifest
kubectl delete -f https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/master/dcgm-exporter.yamlkubectl delete servicemonitor dcgm-exporter-service-monitor -n default

✅ Final Thoughts

With DCGM Exporter integrated into kube-prometheus-stack, you now have full visibility into NVIDIA GPU usage inside your Kubernetes cluster.

This setup allows you to build Grafana dashboards, configure alerts for overheating or low utilization, and ensure you’re getting maximum ROI from your GPU investments.

If you’re running ML or AI workloads at scale, this monitoring stack is a must-have for reliability and cost optimization.