Skip to content

Add a metric and an alert for conditional risk condtions#1318

Open
hongkailiu wants to merge 2 commits intoopenshift:mainfrom
hongkailiu:cluster_version_risk_condition
Open

Add a metric and an alert for conditional risk condtions#1318
hongkailiu wants to merge 2 commits intoopenshift:mainfrom
hongkailiu:cluster_version_risk_condition

Conversation

@hongkailiu
Copy link
Member

Follow up #1284 (comment)

The new metic is cluster_version_risk_conditions{name, condition, risk} and the alert is firing if some risk applies to the cluster.

@coderabbitai
Copy link

coderabbitai bot commented Feb 15, 2026

Walkthrough

Adds a new PrometheusRule alert (RiskApplies) and a new Prometheus GaugeVec metric cluster_version_risk_conditions with collection logic and tests to expose cluster version conditional update risk conditions.

Changes

Cohort / File(s) Summary
Service Monitor Configuration
install/0000_90_cluster-version-operator_02_servicemonitor.yaml
Inserted a new PrometheusRule alert RiskApplies in the cluster-version group that fires when cluster_version_risk_conditions{condition="Applies"} is true for 10m; includes summary, description, severity, and runbook_url.
Metric Implementation
pkg/cvo/metrics.go
Added clusterVersionRiskConditions *prometheus.GaugeVec to operatorMetrics, initialized it, exposed its descriptor in Describe, added collectConditionalUpdateRisks to convert ConditionalUpdateRisk entries into labeled 0/1 metrics, and wired the collector into the per-ClusterVersion and main Collect flows.
Metric Tests
pkg/cvo/metrics_test.go
Added Test_collectConditionalUpdateRisks to validate metric emission for multiple ConditionalUpdateRisk scenarios (no risks, unknown, Applies true/false/unknown). Note: the test was added twice (duplicate definitions present).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

No actionable comments were generated in the recent review. 🎉


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 15, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hongkailiu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 15, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@install/0000_90_cluster-version-operator_02_servicemonitor.yaml`:
- Around line 86-95: The runbook_url for the alert RiskApplies points to a
non-existent path; update the runbook_url field in the RiskApplies alert stanza
to use the cluster-version-operator runbook path used by similar alerts (e.g.,
match the URL used by ClusterVersionOperatorDown) — i.e., replace
https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/RiskApplies.md
with
https://github.com/openshift/runbooks/blob/master/alerts/cluster-version-operator/RiskApplies.md
or alternatively create the missing runbook at the original path so the current
URL remains valid.

In `@pkg/cvo/metrics_test.go`:
- Line 976: The test function named Test_collectConditionalUpdates is testing
collectConditionalUpdateRisks (not collectConditionalUpdates), which is
misleading and conflicts with TestCollectUnknownConditionalUpdates; rename the
test to something like Test_collectConditionalUpdateRisks (or
TestCollectConditionalUpdateRisks) and update any references so the test name
clearly matches the function under test (collectConditionalUpdateRisks) and does
not collide with the existing TestCollectUnknownConditionalUpdates that targets
collectConditionalUpdates.

Comment on lines +86 to +95
- alert: RiskApplies
annotations:
summary: The cluster has been exposed to the conditional update risk for 10 minutes.
description: The conditional update risk {{ "{{ $labels.risk }}" }} applies to the cluster, and the cluster update to a version exposed to the risk is not recommended. For more information refer to 'oc adm upgrade'.
runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/RiskApplies.md
expr: |
max by (namespace, name, risk) (cluster_version_risk_conditions{job="cluster-version-operator", name="version", condition="Applies"} == 1)
for: 10m
labels:
severity: warning
Copy link

@coderabbitai coderabbitai bot Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Check the actual file content around lines 86-95
cat -n install/0000_90_cluster-version-operator_02_servicemonitor.yaml | sed -n '80,100p'

Repository: openshift/cluster-version-operator

Length of output: 1817


🏁 Script executed:

# Look at all runbook_url entries in the file to understand the pattern
rg "runbook_url:" install/0000_90_cluster-version-operator_02_servicemonitor.yaml -A 1 -B 1

Repository: openshift/cluster-version-operator

Length of output: 2565


🏁 Script executed:

# Check the structure around line 47 to compare ClusterVersionOperatorDown
cat -n install/0000_90_cluster-version-operator_02_servicemonitor.yaml | sed -n '40,55p'

Repository: openshift/cluster-version-operator

Length of output: 1279


🏁 Script executed:

# Try to access the runbooks repository to check file existence
# Since we can't clone, we'll use a different approach to check GitHub

# Check if the cluster-monitoring-operator path exists
curl -s -I "https://raw.githubusercontent.com/openshift/runbooks/master/alerts/cluster-monitoring-operator/RiskApplies.md" | head -1

# Check if the cluster-version-operator path exists  
curl -s -I "https://raw.githubusercontent.com/openshift/runbooks/master/alerts/cluster-version-operator/RiskApplies.md" | head -1

Repository: openshift/cluster-version-operator

Length of output: 102


🏁 Script executed:

# Verify if the ClusterVersionOperatorDown runbook exists (which is referenced at line 47)
curl -s -I "https://raw.githubusercontent.com/openshift/runbooks/master/alerts/cluster-version-operator/ClusterVersionOperatorDown.md" | head -1

# Also check the cluster-monitoring-operator directory structure
curl -s -I "https://raw.githubusercontent.com/openshift/runbooks/master/alerts/cluster-monitoring-operator/ClusterOperatorDown.md" | head -1

Repository: openshift/cluster-version-operator

Length of output: 102


The runbook URL references a non-existent file.

The runbook_url at line 90 points to a non-existent path. Neither cluster-monitoring-operator/RiskApplies.md nor cluster-version-operator/RiskApplies.md exist in the runbooks repository.

Since RiskApplies is a cluster-version-operator-specific alert (it monitors cluster_version_risk_conditions), it should follow the pattern of other CVO alerts like ClusterVersionOperatorDown at line 47, which uses:
https://github.com/openshift/runbooks/blob/master/alerts/cluster-version-operator/RiskApplies.md

Either create the runbook at that location or update the URL to reference an existing runbook.

🤖 Prompt for AI Agents
In `@install/0000_90_cluster-version-operator_02_servicemonitor.yaml` around lines
86 - 95, The runbook_url for the alert RiskApplies points to a non-existent
path; update the runbook_url field in the RiskApplies alert stanza to use the
cluster-version-operator runbook path used by similar alerts (e.g., match the
URL used by ClusterVersionOperatorDown) — i.e., replace
https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/RiskApplies.md
with
https://github.com/openshift/runbooks/blob/master/alerts/cluster-version-operator/RiskApplies.md
or alternatively create the missing runbook at the original path so the current
URL remains valid.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am working on it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hongkailiu, thank you for confirming! I'll wait for the update to the runbook URL.

@hongkailiu hongkailiu force-pushed the cluster_version_risk_condition branch from 05082f4 to 3b6c582 Compare February 15, 2026 13:51
@hongkailiu hongkailiu force-pushed the cluster_version_risk_condition branch from 3b6c582 to 129bc80 Compare February 15, 2026 13:52
@hongkailiu
Copy link
Member Author

Testing with 129bc80:

launch 4.22,openshift/cluster-version-operator#1318 aws,techpreview

Then

$ oc patch clusterversion/version --patch '{"spec":{"upstream":"https://fauxinnati-fauxinnati.apps.ota-stage.q2z4.p1.openshiftapps.com/api/upgrades_info/graph"}}' --type=merge

$ oc adm upgrade channel risks-always

The alert is Pending.

Screenshot 2026-02-15 at 10 46 12

and

Screenshot 2026-02-15 at 10 46 27

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 15, 2026

@hongkailiu: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hongkailiu
Copy link
Member Author

/hold

This feature has to be gated by TP.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant