Running GPU workloads on Kubernetes is common for AI, ML, and deep learning applications, but monitoring those workloads isn’t always straightforward. If you’re using NVIDIA GPUs in your Kubernetes cluster, you’ll likely want to track utilization, memory, temperature, and power consumption in Prometheus.
This guide walks you through how to expose NVIDIA GPU metrics with DCGM Exporter and integrate them into the kube-prometheus-stack.
👉 Note: DCGM Exporter does not support time-sliced GPUs. Make sure to disable GPU time-slicing if you want accurate metrics.
Why Monitor NVIDIA GPU Metrics on Kubernetes?
When training or serving ML models, GPUs are often the most expensive resource in your cluster. Without visibility, you risk underutilization, bottlenecks, or overheating. By exporting GPU metrics to Prometheus, you can:
- Track GPU utilization and memory pressure
- Identify copy-bound workloads
- Monitor power usage and temperature for stability
- Set up Grafana dashboards and alerts for proactive monitoring
This ensures you’re getting the most out of your GPUs and keeping costs under control.
🔧 Prerequisites
Before starting, make sure your cluster has:
- Prometheus installed (via
kube-prometheus-stack) - NVIDIA GPU Operator installed with GPUs available on nodes
kubectlandhelmconfigured for your Kubernetes cluster
⚡ Step 1: Install DCGM Exporter
There are two ways to deploy the exporter.
Option A: Install via Helm (recommended)
helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
&& helm repo update
&& helm install dcgm-exporter gpu-helm-charts/dcgm-exporterOption B: Apply the upstream manifest (quick test)
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/master/dcgm-exporter.yaml⚡ Step 2: Create a ServiceMonitor for Prometheus
Prometheus doesn’t automatically scrape GPU metrics — you need a ServiceMonitor to tell it where to look.
First, inspect the dcgm-exporter Service:
kubectl get svc dcgm-exporter -o yamlYou should see something like:
...
ports:
- name: metrics
port: 9400
targetPort: 9400
selector:
app.kubernetes.io/name: dcgm-exporterNow create a ServiceMonitor that matches your Prometheus release:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: dcgm-exporter-service-monitor
namespace: default
labels:
release: prometheus # Must match your kube-prometheus-stack release name
spec:
selector:
matchLabels:
app.kubernetes.io/name: dcgm-exporter
endpoints:
- port: "metrics"
path: /metrics
interval: 15s
namespaceSelector:
matchNames:
- default👉 If your Prometheus Helm release name isn’t prometheus, update the release label accordingly.
⚡ Step 3: Validate GPU Metrics in Prometheus
- Open the Prometheus UI → Status → Targets and confirm
dcgm-exporterisUP. - Run a few test queries:
DCGM_FI_DEV_GPU_UTIL
DCGM_FI_DEV_MEM_COPY_UTIL
DCGM_FI_DEV_FB_USEDpr
DCGM_FI_DEV_POWER_USAGE
DCGM_FI_DEV_GPU_TEMPKey metrics explained:
- DCGM_FI_DEV_GPU_UTIL → GPU utilization % (how busy the GPU cores are)
- DCGM_FI_DEV_MEM_COPY_UTIL → Memory copy engine utilization %
- DCGM_FI_DEV_FB_USED → GPU memory used (MiB)
- DCGM_FI_DEV_POWER_USAGE → Power draw in watts
- DCGM_FI_DEV_GPU_TEMP → GPU temperature (°C)
Example PromQL aggregations:
# Average GPU utilization by node
avg by (instance) (DCGM_FI_DEV_GPU_UTIL)# Total framebuffer memory used by node (MiB)
sum by (instance) (DCGM_FI_DEV_FB_USED)
# Cluster-wide average GPU utilization
avg(DCGM_FI_DEV_GPU_UTIL)
⚡ Troubleshooting Tips
- Exporter target is DOWN → Check
ServiceMonitorlabels match your Prometheus release. - No metrics available → Inspect exporter pod logs, verify GPU operator health.
- Namespace mismatch → Ensure your Prometheus operator watches the correct namespace.
⚡ Cleanup
If you need to remove the exporter:
helm uninstall dcgm-exporter
# Or if deployed via manifest
kubectl delete -f https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/master/dcgm-exporter.yamlkubectl delete servicemonitor dcgm-exporter-service-monitor -n default
✅ Final Thoughts
With DCGM Exporter integrated into kube-prometheus-stack, you now have full visibility into NVIDIA GPU usage inside your Kubernetes cluster.
This setup allows you to build Grafana dashboards, configure alerts for overheating or low utilization, and ensure you’re getting maximum ROI from your GPU investments.
If you’re running ML or AI workloads at scale, this monitoring stack is a must-have for reliability and cost optimization.
