Skip to content

Conversation

@surajssd
Copy link
Member

@surajssd surajssd commented Jan 14, 2026

Add DCGM Exporter Support for GPU Metrics Collection

This PR adds support for NVIDIA DCGM (Data Center GPU Manager) exporter as a configurable scrape target to collect GPU metrics from Kubernetes nodes in AKS clusters.

Summary

  • New scrape target: Added dcgm-exporter as a default scrape target for GPU metrics collection
  • Configurable: Disabled by default, can be enabled via ConfigMap settings
  • Optimized for dashboards: Includes cardinality reduction by dropping unused high-cardinality labels
  • Node affinity fix: Corrected invalid labelSelector syntax in DaemonSet configuration

Key Changes

  1. DCGM Exporter Configuration
    • Added dcgmExporterDefault.yml with Prometheus scrape config
    • Targets nodes with label kubernetes.azure1.com/dcgm-exporter=enabled on port 19400
    • Default scrape interval: 30s when enabled
    • Minimal ingestion profile pattern: DCGM_.*
  2. Cardinality Optimization
    • Drops high-cardinality labels not used in default Grafana dashboards:
      • cluster, device, hostname, modelName, pci_bus_id, uuid, dcgm_fi_driver_version, microsoft.resourceid
    • Preserves essential labels: instance, gpu, job
  3. Configuration Integration
    • Updated all ConfigMap templates to include dcgmexporter settings
    • Added support for custom scrape intervals and metrics regex overrides
    • Integrated with existing telemetry and configuration parsing logic
    • Added Ginkgo e2e test coverage
  4. Bug Fix
    • Fixed invalid labelSelector field in DaemonSet nodeAffinity configuration
    • Moved matchExpressions to correct location under nodeSelectorTerms

Testing

  • Added comprehensive e2e tests for custom scrape intervals and regex overrides
  • Verified integration with existing configuration processing pipeline
  • Updated test expectations to account for new scrape target (11 jobs instead of 10)

DCGM Exporter Metrics Documentation

DCGM Exporter Metrics Collection

When DCGM exporter is enabled on GPU-enabled nodes (nodes labeled with kubernetes.azure.com/dcgm-exporter=enabled), the following metrics are collected from NVIDIA GPUs:

Performance and Utilization Metrics

Metric Type Purpose
DCGM_FI_DEV_SM_CLOCK gauge SM clock frequency (MHz) - indicates GPU compute unit clock speed
DCGM_FI_DEV_MEM_CLOCK gauge Memory clock frequency (MHz) - indicates GPU memory subsystem speed
DCGM_FI_DEV_GPU_UTIL gauge GPU utilization percentage - overall GPU usage
DCGM_FI_DEV_MEM_COPY_UTIL gauge Memory utilization percentage - memory bandwidth usage
DCGM_FI_DEV_ENC_UTIL gauge Encoder utilization percentage - video encoding engine usage
DCGM_FI_DEV_DEC_UTIL gauge Decoder utilization percentage - video decoding engine usage

Temperature Monitoring

Metric Type Purpose
DCGM_FI_DEV_GPU_TEMP gauge GPU temperature (Celsius) - thermal monitoring for throttling prevention
DCGM_FI_DEV_MEMORY_TEMP gauge Memory temperature (Celsius) - memory thermal monitoring

Power and Energy

Metric Type Purpose
DCGM_FI_DEV_POWER_USAGE gauge Instantaneous power draw (Watts) - current power consumption
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION counter Total energy consumed since boot (millijoules) - cumulative energy tracking

Memory Usage

Metric Type Purpose
DCGM_FI_DEV_FB_FREE gauge Framebuffer memory free (MiB) - available GPU memory
DCGM_FI_DEV_FB_USED gauge Framebuffer memory used (MiB) - allocated GPU memory
DCGM_FI_DEV_FB_RESERVED gauge Framebuffer memory reserved (MiB) - system-reserved GPU memory

Hardware Health and Errors

Metric Type Purpose
DCGM_FI_DEV_XID_ERRORS gauge Last XID error code - GPU hardware/software error indicator
DCGM_FI_DEV_PCIE_REPLAY_COUNTER counter PCIe retry count - PCIe link stability indicator
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS counter Uncorrectable ECC error remapped rows - critical memory errors
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS counter Correctable ECC error remapped rows - recoverable memory errors
DCGM_FI_DEV_ROW_REMAP_FAILURE gauge Row remapping failure status - memory repair failure indicator

NVLink Interconnect (Multi-GPU systems)

Metric Type Purpose
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL counter Total NVLink bandwidth across all lanes - GPU-to-GPU communication throughput

Datacenter Profiling (DCP) Metrics

Supported on NVIDIA datacenter Volta GPUs and newer

Metric Type Purpose
DCGM_FI_PROF_GR_ENGINE_ACTIVE gauge Graphics engine active ratio - compute pipeline utilization
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE gauge Tensor core active ratio - AI/ML workload utilization
DCGM_FI_PROF_DRAM_ACTIVE gauge Memory interface active ratio - memory bandwidth utilization
DCGM_FI_PROF_PCIE_TX_BYTES gauge PCIe transmit throughput (bytes/sec) - host-to-GPU data transfer rate
DCGM_FI_PROF_PCIE_RX_BYTES gauge PCIe receive throughput (bytes/sec) - GPU-to-host data transfer rate

vGPU Licensing (Virtualized environments)

Metric Type Purpose
DCGM_FI_DEV_VGPU_LICENSE_STATUS gauge vGPU license status - license validity for virtual GPU instances

Static Metadata

Metric Type Purpose
dcgm_fi_driver_version label NVIDIA driver version - appears as label on other metrics for version tracking

Note

The dcgm_fi_driver_version label is intentionally dropped in the default configuration to reduce cardinality, as driver version information can be obtained through other means.

Label Configuration

All DCGM metrics are collected with minimal labels to control cardinality:

Retained Labels:

  • instance - The Kubernetes node name where the GPU is located
  • gpu - The GPU device index on the node (0, 1, 2, etc.)
  • job - The Prometheus job name (dcgm-exporter)

Dropped Labels (for cardinality control):

  • uuid - Unique GPU identifier (extremely high cardinality)
  • device - Redundant with gpu label
  • modelName - GPU model name (static metadata)
  • hostname - Redundant with instance label
  • cluster - Cluster identifier
  • pci_bus_id - PCIe bus location
  • microsoft.resourceid - Azure-specific resource identifier
  • dcgm_fi_driver_version - Driver version (static metadata)

Cardinality Analysis

Base Cardinality: Low to Medium

Dimension Scaling Factor
Nodes with GPUs O(gpu_nodes)
GPUs per node O(gpus_per_node)
Metrics per GPU 27 metrics (32 with DCP metrics on newer GPUs)

Total Time Series Calculation:
time_series = gpu_nodes × gpus_per_node × metrics_per_gpu

Example Scenarios:

Cluster Size GPU Nodes GPUs/Node Total Time Series
Small 5 1 ~160
Medium 20 4 ~2,560
Large 100 8 ~25,600
Extra Large 500 8 ~128,000

Cardinality Control Strategies Applied

  1. Label Reduction: Only 3 labels retained (instance, gpu, job) instead of 11+ available labels
  2. No High-Cardinality Identifiers: UUID and PCI bus ID labels are dropped
  3. No Dynamic Labels: Static metadata (model name, driver version) excluded from metric labels
  4. Node-Level Scraping: Metrics scraped per node, not per pod, reducing scrape targets

@rashmichandrashekar
Copy link
Contributor

/azp run

@surajssd surajssd force-pushed the suraj/add-support-to-scrape-dcgm branch 4 times, most recently from b551d95 to 4de6e7a Compare January 17, 2026 00:55
@rashmichandrashekar
Copy link
Contributor

rashmichandrashekar commented Jan 21, 2026

pls update the test configmaps like in this pr - https://github.com/Azure/prometheus-collector/pull/1320/files#diff-1d1a1afb8777d533903a1c048f796d868b8182f9f321ec9eb617edde2af2669e

and update the target count in the tests in this file - https://github.com/Azure/prometheus-collector/blob/main/otelcollector/test/ginkgo-e2e/configprocessing/config_processing_test.go

for ex - the following should have 11 instead of 10 -

Expect(len(prometheusConfig.ScrapeConfigs)).To(BeNumerically("==", 10))

Expect(len(prometheusConfig.ScrapeConfigs)).To(BeNumerically("==", 10))

and your job should be added in the arrays below.

@surajssd surajssd force-pushed the suraj/add-support-to-scrape-dcgm branch from 9eb8f41 to cfc742f Compare January 21, 2026 18:33
@Azure Azure deleted a comment from azure-pipelines bot Jan 21, 2026
@Azure Azure deleted a comment from azure-pipelines bot Jan 21, 2026
@Azure Azure deleted a comment from azure-pipelines bot Jan 21, 2026
@Azure Azure deleted a comment from azure-pipelines bot Jan 21, 2026
@Azure Azure deleted a comment from azure-pipelines bot Jan 21, 2026
@Azure Azure deleted a comment from azure-pipelines bot Jan 21, 2026
@Azure Azure deleted a comment from azure-pipelines bot Jan 21, 2026
@Azure Azure deleted a comment from azure-pipelines bot Jan 21, 2026
@Azure Azure deleted a comment from azure-pipelines bot Jan 22, 2026
@surajssd surajssd force-pushed the suraj/add-support-to-scrape-dcgm branch from b7d2a06 to 892cce1 Compare January 22, 2026 19:14
@Azure Azure deleted a comment from azure-pipelines bot Jan 22, 2026
@surajssd surajssd force-pushed the suraj/add-support-to-scrape-dcgm branch from 892cce1 to cdcb034 Compare January 22, 2026 22:15
@Azure Azure deleted a comment from azure-pipelines bot Jan 22, 2026
Remove incorrect `labelSelector` field from `nodeAffinity` configuration. The
`matchExpressions` should be directly under `nodeSelectorTerms`, not under a
`labelSelector` field.

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
Adds support for NVIDIA DCGM (Data Center GPU Manager) exporter as a new default
scrape target to collect GPU metrics from Kubernetes nodes.

- Add `dcgmExporterDefault.yml` with Prometheus scrape config targeting nodes
  with label `kubernetes.azure.com/dcgm-exporter=enabled` on port 19400
- Update all configmap templates to include `dcgmexporter` settings (enabled
  flag, metrics keep list regex, scrape interval)
- Add minimal ingestion profile pattern (`DCGM_.*`) for `dcgmexporter` metrics
- Add Ginkgo e2e test coverage for `dcgmexporter` custom scrape intervals and
  regex overrides
- Default configuration: disabled by default, 30s scrape interval when enabled

This enables GPU metrics collection from DCGM exporters running on GPU-enabled
nodes in AKS clusters, following the same configuration patterns.

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
Remove unused labels (`cluster`, `device`, `hostname`, `modelName`,
`pci_bus_id`, `uuid`, `dcgm_fi_driver_version`, `microsoft.resourceid`) to
reduce cardinality while preserving dashboard functionality that only requires
`instance` and `gpu` labels.

All the panels in the default DCGM grafana dashboard use only `instance`, `gpu`
and `job` labels.

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
@surajssd surajssd force-pushed the suraj/add-support-to-scrape-dcgm branch from 56fca99 to 8a61e41 Compare January 23, 2026 21:15
@Azure Azure deleted a comment from azure-pipelines bot Jan 23, 2026
@Azure Azure deleted a comment from azure-pipelines bot Jan 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants