Skip to content

Conversation

@ec-jt
Copy link

@ec-jt ec-jt commented Jun 10, 2025

Fix exporter crash with CUDA 12.9 by sanitising invalid metric names (e.g. fields with [us]) to ensure Prometheus compatibility.

@ec-jt ec-jt requested a review from utkuozdemir as a code owner June 10, 2025 17:56
@utkuozdemir
Copy link
Owner

I thought I already handled the metrics ending with [us] in this PR - can you share one of the metric names that fail here?

@Harry-zklcdc
Copy link

The reason seems no the suffix of '[us]'.

I noticed that the metric inforom.checksum_validation can cause longer printing times, leading to crashes.

The metrix was added at NVIDIA Linux Driver 575.

The following is an introduction to this metric by nvidia-smi --help-query-gpu

"inforom.checksum_validation"
Inforom Checksum Validation information.

@utkuozdemir
Copy link
Owner

The reason seems no the suffix of '[us]'.

I noticed that the metric inforom.checksum_validation can cause longer printing times, leading to crashes.

The metrix was added at NVIDIA Linux Driver 575.

The following is an introduction to this metric by nvidia-smi --help-query-gpu


"inforom.checksum_validation"

Inforom Checksum Validation information.

Should this be a separate issue then? Because it doesn't seem to be connected to the OP's issue.

@Harry-zklcdc
Copy link

Harry-zklcdc commented Oct 26, 2025

Should this be a separate issue then? Because it doesn't seem to be connected to the OP's issue.这应该是一个单独的问题吗?因为它似乎与原帖的问题没有关联。

Because I find this problem when using NVIDIA Linux Driver 575.x.x (CUDA 12.9)/NVIDIA Linux Driver 580.x.x (CUDA 13.0) at 4090D/A10. It may cause NvidiaGpuExporter crash.

And It find that NvidiaGpuExporter would be OK after remove this metric.

Actually, this problem occurs when the GPU is under high load.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants