-
Notifications
You must be signed in to change notification settings - Fork 48
Add DCGM exporter support for GPU metrics collection #1391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
/azp run |
...ollector/deploy/addon-chart/azure-monitor-metrics-addon/templates/ama-metrics-daemonset.yaml
Show resolved
Hide resolved
otelcollector/configmapparser/default-prom-configs/dcgmExporterDefault.yml
Show resolved
Hide resolved
b551d95 to
4de6e7a
Compare
otelcollector/shared/configmap/mp/tomlparser-default-scrape-settings.go
Outdated
Show resolved
Hide resolved
otelcollector/shared/configmap/mp/tomlparser-default-scrape-settings.go
Outdated
Show resolved
Hide resolved
otelcollector/shared/configmap/mp/tomlparser-default-targets-metrics-keep-list.go
Show resolved
Hide resolved
|
pls update the test configmaps like in this pr - https://github.com/Azure/prometheus-collector/pull/1320/files#diff-1d1a1afb8777d533903a1c048f796d868b8182f9f321ec9eb617edde2af2669e and update the target count in the tests in this file - https://github.com/Azure/prometheus-collector/blob/main/otelcollector/test/ginkgo-e2e/configprocessing/config_processing_test.go for ex - the following should have 11 instead of 10 - prometheus-collector/otelcollector/test/ginkgo-e2e/configprocessing/config_processing_test.go Line 301 in c03fbba
prometheus-collector/otelcollector/test/ginkgo-e2e/configprocessing/config_processing_test.go Line 389 in c03fbba
and your job should be added in the arrays below. |
9eb8f41 to
cfc742f
Compare
b7d2a06 to
892cce1
Compare
892cce1 to
cdcb034
Compare
3ce1b50 to
56fca99
Compare
Remove incorrect `labelSelector` field from `nodeAffinity` configuration. The `matchExpressions` should be directly under `nodeSelectorTerms`, not under a `labelSelector` field. Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
Adds support for NVIDIA DCGM (Data Center GPU Manager) exporter as a new default scrape target to collect GPU metrics from Kubernetes nodes. - Add `dcgmExporterDefault.yml` with Prometheus scrape config targeting nodes with label `kubernetes.azure.com/dcgm-exporter=enabled` on port 19400 - Update all configmap templates to include `dcgmexporter` settings (enabled flag, metrics keep list regex, scrape interval) - Add minimal ingestion profile pattern (`DCGM_.*`) for `dcgmexporter` metrics - Add Ginkgo e2e test coverage for `dcgmexporter` custom scrape intervals and regex overrides - Default configuration: disabled by default, 30s scrape interval when enabled This enables GPU metrics collection from DCGM exporters running on GPU-enabled nodes in AKS clusters, following the same configuration patterns. Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
Remove unused labels (`cluster`, `device`, `hostname`, `modelName`, `pci_bus_id`, `uuid`, `dcgm_fi_driver_version`, `microsoft.resourceid`) to reduce cardinality while preserving dashboard functionality that only requires `instance` and `gpu` labels. All the panels in the default DCGM grafana dashboard use only `instance`, `gpu` and `job` labels. Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
56fca99 to
8a61e41
Compare
Add DCGM Exporter Support for GPU Metrics Collection
This PR adds support for NVIDIA DCGM (Data Center GPU Manager) exporter as a configurable scrape target to collect GPU metrics from Kubernetes nodes in AKS clusters.
Summary
dcgm-exporteras a default scrape target for GPU metrics collectionlabelSelectorsyntax in DaemonSet configurationKey Changes
dcgmExporterDefault.ymlwith Prometheus scrape configkubernetes.azure1.com/dcgm-exporter=enabledon port 19400DCGM_.*cluster,device,hostname,modelName,pci_bus_id,uuid,dcgm_fi_driver_version,microsoft.resourceidinstance,gpu,jobdcgmexportersettingslabelSelectorfield in DaemonSetnodeAffinityconfigurationmatchExpressionsto correct location undernodeSelectorTermsTesting
DCGM Exporter Metrics Documentation
DCGM Exporter Metrics Collection
When DCGM exporter is enabled on GPU-enabled nodes (nodes labeled with
kubernetes.azure.com/dcgm-exporter=enabled), the following metrics are collected from NVIDIA GPUs:Performance and Utilization Metrics
DCGM_FI_DEV_SM_CLOCKDCGM_FI_DEV_MEM_CLOCKDCGM_FI_DEV_GPU_UTILDCGM_FI_DEV_MEM_COPY_UTILDCGM_FI_DEV_ENC_UTILDCGM_FI_DEV_DEC_UTILTemperature Monitoring
DCGM_FI_DEV_GPU_TEMPDCGM_FI_DEV_MEMORY_TEMPPower and Energy
DCGM_FI_DEV_POWER_USAGEDCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTIONMemory Usage
DCGM_FI_DEV_FB_FREEDCGM_FI_DEV_FB_USEDDCGM_FI_DEV_FB_RESERVEDHardware Health and Errors
DCGM_FI_DEV_XID_ERRORSDCGM_FI_DEV_PCIE_REPLAY_COUNTERDCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWSDCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWSDCGM_FI_DEV_ROW_REMAP_FAILURENVLink Interconnect (Multi-GPU systems)
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTALDatacenter Profiling (DCP) Metrics
Supported on NVIDIA datacenter Volta GPUs and newer
DCGM_FI_PROF_GR_ENGINE_ACTIVEDCGM_FI_PROF_PIPE_TENSOR_ACTIVEDCGM_FI_PROF_DRAM_ACTIVEDCGM_FI_PROF_PCIE_TX_BYTESDCGM_FI_PROF_PCIE_RX_BYTESvGPU Licensing (Virtualized environments)
DCGM_FI_DEV_VGPU_LICENSE_STATUSStatic Metadata
dcgm_fi_driver_versionNote
The
dcgm_fi_driver_versionlabel is intentionally dropped in the default configuration to reduce cardinality, as driver version information can be obtained through other means.Label Configuration
All DCGM metrics are collected with minimal labels to control cardinality:
Retained Labels:
instance- The Kubernetes node name where the GPU is locatedgpu- The GPU device index on the node (0, 1, 2, etc.)job- The Prometheus job name (dcgm-exporter)Dropped Labels (for cardinality control):
uuid- Unique GPU identifier (extremely high cardinality)device- Redundant withgpulabelmodelName- GPU model name (static metadata)hostname- Redundant withinstancelabelcluster- Cluster identifierpci_bus_id- PCIe bus locationmicrosoft.resourceid- Azure-specific resource identifierdcgm_fi_driver_version- Driver version (static metadata)Cardinality Analysis
Base Cardinality: Low to Medium
Total Time Series Calculation:
time_series = gpu_nodes × gpus_per_node × metrics_per_gpu
Example Scenarios:
Cardinality Control Strategies Applied
instance,gpu,job) instead of 11+ available labels