Add DCGM exporter support for GPU metrics collection #1391

surajssd · 2026-01-14T18:55:20Z

Add DCGM Exporter Support for GPU Metrics Collection

This PR adds support for NVIDIA DCGM (Data Center GPU Manager) exporter as a configurable scrape target to collect GPU metrics from Kubernetes nodes in AKS clusters.

Summary

New scrape target: Added dcgm-exporter as a default scrape target for GPU metrics collection
Configurable: Disabled by default, can be enabled via ConfigMap settings
Optimized for dashboards: Includes cardinality reduction by dropping unused high-cardinality labels
Node affinity fix: Corrected invalid labelSelector syntax in DaemonSet configuration

Key Changes

DCGM Exporter Configuration
- Added dcgmExporterDefault.yml with Prometheus scrape config
- Targets nodes with label kubernetes.azure1.com/dcgm-exporter=enabled on port 19400
- Default scrape interval: 30s when enabled
- Minimal ingestion profile pattern: DCGM_.*
Cardinality Optimization
- Drops high-cardinality labels not used in default Grafana dashboards:
  - cluster, device, hostname, modelName, pci_bus_id, uuid, dcgm_fi_driver_version, microsoft.resourceid
- Preserves essential labels: instance, gpu, job
Configuration Integration
- Updated all ConfigMap templates to include dcgmexporter settings
- Added support for custom scrape intervals and metrics regex overrides
- Integrated with existing telemetry and configuration parsing logic
- Added Ginkgo e2e test coverage
Bug Fix
- Fixed invalid labelSelector field in DaemonSet nodeAffinity configuration
- Moved matchExpressions to correct location under nodeSelectorTerms

Testing

Added comprehensive e2e tests for custom scrape intervals and regex overrides
Verified integration with existing configuration processing pipeline
Updated test expectations to account for new scrape target (11 jobs instead of 10)

DCGM Exporter Metrics Documentation

DCGM Exporter Metrics Collection

When DCGM exporter is enabled on GPU-enabled nodes (nodes labeled with kubernetes.azure.com/dcgm-exporter=enabled), the following metrics are collected from NVIDIA GPUs:

Performance and Utilization Metrics

Metric	Type	Purpose
`DCGM_FI_DEV_SM_CLOCK`	gauge	SM clock frequency (MHz) - indicates GPU compute unit clock speed
`DCGM_FI_DEV_MEM_CLOCK`	gauge	Memory clock frequency (MHz) - indicates GPU memory subsystem speed
`DCGM_FI_DEV_GPU_UTIL`	gauge	GPU utilization percentage - overall GPU usage
`DCGM_FI_DEV_MEM_COPY_UTIL`	gauge	Memory utilization percentage - memory bandwidth usage
`DCGM_FI_DEV_ENC_UTIL`	gauge	Encoder utilization percentage - video encoding engine usage
`DCGM_FI_DEV_DEC_UTIL`	gauge	Decoder utilization percentage - video decoding engine usage

Temperature Monitoring

Metric	Type	Purpose
`DCGM_FI_DEV_GPU_TEMP`	gauge	GPU temperature (Celsius) - thermal monitoring for throttling prevention
`DCGM_FI_DEV_MEMORY_TEMP`	gauge	Memory temperature (Celsius) - memory thermal monitoring

Power and Energy

Metric	Type	Purpose
`DCGM_FI_DEV_POWER_USAGE`	gauge	Instantaneous power draw (Watts) - current power consumption
`DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION`	counter	Total energy consumed since boot (millijoules) - cumulative energy tracking

Memory Usage

Metric	Type	Purpose
`DCGM_FI_DEV_FB_FREE`	gauge	Framebuffer memory free (MiB) - available GPU memory
`DCGM_FI_DEV_FB_USED`	gauge	Framebuffer memory used (MiB) - allocated GPU memory
`DCGM_FI_DEV_FB_RESERVED`	gauge	Framebuffer memory reserved (MiB) - system-reserved GPU memory

Hardware Health and Errors

Metric	Type	Purpose
`DCGM_FI_DEV_XID_ERRORS`	gauge	Last XID error code - GPU hardware/software error indicator
`DCGM_FI_DEV_PCIE_REPLAY_COUNTER`	counter	PCIe retry count - PCIe link stability indicator
`DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS`	counter	Uncorrectable ECC error remapped rows - critical memory errors
`DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS`	counter	Correctable ECC error remapped rows - recoverable memory errors
`DCGM_FI_DEV_ROW_REMAP_FAILURE`	gauge	Row remapping failure status - memory repair failure indicator

NVLink Interconnect (Multi-GPU systems)

Metric	Type	Purpose
`DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL`	counter	Total NVLink bandwidth across all lanes - GPU-to-GPU communication throughput

Datacenter Profiling (DCP) Metrics

Supported on NVIDIA datacenter Volta GPUs and newer

Metric	Type	Purpose
`DCGM_FI_PROF_GR_ENGINE_ACTIVE`	gauge	Graphics engine active ratio - compute pipeline utilization
`DCGM_FI_PROF_PIPE_TENSOR_ACTIVE`	gauge	Tensor core active ratio - AI/ML workload utilization
`DCGM_FI_PROF_DRAM_ACTIVE`	gauge	Memory interface active ratio - memory bandwidth utilization
`DCGM_FI_PROF_PCIE_TX_BYTES`	gauge	PCIe transmit throughput (bytes/sec) - host-to-GPU data transfer rate
`DCGM_FI_PROF_PCIE_RX_BYTES`	gauge	PCIe receive throughput (bytes/sec) - GPU-to-host data transfer rate

vGPU Licensing (Virtualized environments)

Metric	Type	Purpose
`DCGM_FI_DEV_VGPU_LICENSE_STATUS`	gauge	vGPU license status - license validity for virtual GPU instances

Static Metadata

Metric	Type	Purpose
`dcgm_fi_driver_version`	label	NVIDIA driver version - appears as label on other metrics for version tracking

Note

The dcgm_fi_driver_version label is intentionally dropped in the default configuration to reduce cardinality, as driver version information can be obtained through other means.

Label Configuration

All DCGM metrics are collected with minimal labels to control cardinality:

Retained Labels:

instance - The Kubernetes node name where the GPU is located
gpu - The GPU device index on the node (0, 1, 2, etc.)
job - The Prometheus job name (dcgm-exporter)

Dropped Labels (for cardinality control):

uuid - Unique GPU identifier (extremely high cardinality)
device - Redundant with gpu label
modelName - GPU model name (static metadata)
hostname - Redundant with instance label
cluster - Cluster identifier
pci_bus_id - PCIe bus location
microsoft.resourceid - Azure-specific resource identifier
dcgm_fi_driver_version - Driver version (static metadata)

Cardinality Analysis

Base Cardinality: Low to Medium

Dimension	Scaling Factor
Nodes with GPUs	O(gpu_nodes)
GPUs per node	O(gpus_per_node)
Metrics per GPU	27 metrics (32 with DCP metrics on newer GPUs)

Total Time Series Calculation:
time_series = gpu_nodes × gpus_per_node × metrics_per_gpu

Example Scenarios:

Cluster Size	GPU Nodes	GPUs/Node	Total Time Series
Small	5	1	~160
Medium	20	4	~2,560
Large	100	8	~25,600
Extra Large	500	8	~128,000

Cardinality Control Strategies Applied

Label Reduction: Only 3 labels retained (instance, gpu, job) instead of 11+ available labels
No High-Cardinality Identifiers: UUID and PCI bus ID labels are dropped
No Dynamic Labels: Static metadata (model name, driver version) excluded from metric labels
Node-Level Scraping: Metrics scraped per node, not per pod, reducing scrape targets

rashmichandrashekar · 2026-01-14T19:02:03Z

/azp run

otelcollector/shared/configmap/mp/prometheus-config-merger.go

...ollector/deploy/addon-chart/azure-monitor-metrics-addon/templates/ama-metrics-daemonset.yaml

otelcollector/shared/configmap/mp/prometheus-config-merger.go

otelcollector/configmapparser/default-prom-configs/dcgmExporterDefault.yml

otelcollector/shared/configmap/mp/tomlparser-default-scrape-settings.go

otelcollector/shared/configmap/mp/tomlparser-scrape-interval.go

otelcollector/shared/configmap/mp/tomlparser-default-targets-metrics-keep-list.go

rashmichandrashekar · 2026-01-21T01:40:00Z

pls update the test configmaps like in this pr - https://github.com/Azure/prometheus-collector/pull/1320/files#diff-1d1a1afb8777d533903a1c048f796d868b8182f9f321ec9eb617edde2af2669e

and update the target count in the tests in this file - https://github.com/Azure/prometheus-collector/blob/main/otelcollector/test/ginkgo-e2e/configprocessing/config_processing_test.go

for ex - the following should have 11 instead of 10 -

prometheus-collector/otelcollector/test/ginkgo-e2e/configprocessing/config_processing_test.go

Line 301 in c03fbba

Expect(len(prometheusConfig.ScrapeConfigs)).To(BeNumerically("==", 10))

prometheus-collector/otelcollector/test/ginkgo-e2e/configprocessing/config_processing_test.go

Line 389 in c03fbba

Expect(len(prometheusConfig.ScrapeConfigs)).To(BeNumerically("==", 10))

and your job should be added in the arrays below.

otelcollector/shared/configmap/mp/tomlparser-scrape-interval.go

Remove incorrect `labelSelector` field from `nodeAffinity` configuration. The `matchExpressions` should be directly under `nodeSelectorTerms`, not under a `labelSelector` field. Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>

Adds support for NVIDIA DCGM (Data Center GPU Manager) exporter as a new default scrape target to collect GPU metrics from Kubernetes nodes. - Add `dcgmExporterDefault.yml` with Prometheus scrape config targeting nodes with label `kubernetes.azure.com/dcgm-exporter=enabled` on port 19400 - Update all configmap templates to include `dcgmexporter` settings (enabled flag, metrics keep list regex, scrape interval) - Add minimal ingestion profile pattern (`DCGM_.*`) for `dcgmexporter` metrics - Add Ginkgo e2e test coverage for `dcgmexporter` custom scrape intervals and regex overrides - Default configuration: disabled by default, 30s scrape interval when enabled This enables GPU metrics collection from DCGM exporters running on GPU-enabled nodes in AKS clusters, following the same configuration patterns. Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>

Remove unused labels (`cluster`, `device`, `hostname`, `modelName`, `pci_bus_id`, `uuid`, `dcgm_fi_driver_version`, `microsoft.resourceid`) to reduce cardinality while preserving dashboard functionality that only requires `instance` and `gpu` labels. All the panels in the default DCGM grafana dashboard use only `instance`, `gpu` and `job` labels. Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>

surajssd requested a review from a team as a code owner January 14, 2026 18:55

github-actions bot added the size/L label Jan 14, 2026

surajssd mentioned this pull request Jan 14, 2026

feat: Add DCGM exporter support for GPU metrics collection #1360

Closed

22 tasks

rashmichandrashekar reviewed Jan 15, 2026

View reviewed changes

otelcollector/shared/configmap/mp/prometheus-config-merger.go Outdated Show resolved Hide resolved

rashmichandrashekar reviewed Jan 15, 2026

View reviewed changes

...ollector/deploy/addon-chart/azure-monitor-metrics-addon/templates/ama-metrics-daemonset.yaml Show resolved Hide resolved

rashmichandrashekar reviewed Jan 16, 2026

View reviewed changes

otelcollector/shared/configmap/mp/prometheus-config-merger.go Show resolved Hide resolved

rashmichandrashekar reviewed Jan 16, 2026

View reviewed changes

otelcollector/configmapparser/default-prom-configs/dcgmExporterDefault.yml Show resolved Hide resolved

surajssd force-pushed the suraj/add-support-to-scrape-dcgm branch 4 times, most recently from b551d95 to 4de6e7a Compare January 17, 2026 00:55

surajssd requested a review from rashmichandrashekar January 17, 2026 01:08