Add a metric and an alert for conditional risk condtions#1318
Add a metric and an alert for conditional risk condtions#1318hongkailiu wants to merge 2 commits intoopenshift:mainfrom
Conversation
WalkthroughAdds a new PrometheusRule alert (RiskApplies) and a new Prometheus GaugeVec metric Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes ✨ Finishing touches
🧪 Generate unit tests (beta)
No actionable comments were generated in the recent review. 🎉 Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hongkailiu The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@install/0000_90_cluster-version-operator_02_servicemonitor.yaml`:
- Around line 86-95: The runbook_url for the alert RiskApplies points to a
non-existent path; update the runbook_url field in the RiskApplies alert stanza
to use the cluster-version-operator runbook path used by similar alerts (e.g.,
match the URL used by ClusterVersionOperatorDown) — i.e., replace
https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/RiskApplies.md
with
https://github.com/openshift/runbooks/blob/master/alerts/cluster-version-operator/RiskApplies.md
or alternatively create the missing runbook at the original path so the current
URL remains valid.
In `@pkg/cvo/metrics_test.go`:
- Line 976: The test function named Test_collectConditionalUpdates is testing
collectConditionalUpdateRisks (not collectConditionalUpdates), which is
misleading and conflicts with TestCollectUnknownConditionalUpdates; rename the
test to something like Test_collectConditionalUpdateRisks (or
TestCollectConditionalUpdateRisks) and update any references so the test name
clearly matches the function under test (collectConditionalUpdateRisks) and does
not collide with the existing TestCollectUnknownConditionalUpdates that targets
collectConditionalUpdates.
| - alert: RiskApplies | ||
| annotations: | ||
| summary: The cluster has been exposed to the conditional update risk for 10 minutes. | ||
| description: The conditional update risk {{ "{{ $labels.risk }}" }} applies to the cluster, and the cluster update to a version exposed to the risk is not recommended. For more information refer to 'oc adm upgrade'. | ||
| runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/RiskApplies.md | ||
| expr: | | ||
| max by (namespace, name, risk) (cluster_version_risk_conditions{job="cluster-version-operator", name="version", condition="Applies"} == 1) | ||
| for: 10m | ||
| labels: | ||
| severity: warning |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Check the actual file content around lines 86-95
cat -n install/0000_90_cluster-version-operator_02_servicemonitor.yaml | sed -n '80,100p'Repository: openshift/cluster-version-operator
Length of output: 1817
🏁 Script executed:
# Look at all runbook_url entries in the file to understand the pattern
rg "runbook_url:" install/0000_90_cluster-version-operator_02_servicemonitor.yaml -A 1 -B 1Repository: openshift/cluster-version-operator
Length of output: 2565
🏁 Script executed:
# Check the structure around line 47 to compare ClusterVersionOperatorDown
cat -n install/0000_90_cluster-version-operator_02_servicemonitor.yaml | sed -n '40,55p'Repository: openshift/cluster-version-operator
Length of output: 1279
🏁 Script executed:
# Try to access the runbooks repository to check file existence
# Since we can't clone, we'll use a different approach to check GitHub
# Check if the cluster-monitoring-operator path exists
curl -s -I "https://raw.githubusercontent.com/openshift/runbooks/master/alerts/cluster-monitoring-operator/RiskApplies.md" | head -1
# Check if the cluster-version-operator path exists
curl -s -I "https://raw.githubusercontent.com/openshift/runbooks/master/alerts/cluster-version-operator/RiskApplies.md" | head -1Repository: openshift/cluster-version-operator
Length of output: 102
🏁 Script executed:
# Verify if the ClusterVersionOperatorDown runbook exists (which is referenced at line 47)
curl -s -I "https://raw.githubusercontent.com/openshift/runbooks/master/alerts/cluster-version-operator/ClusterVersionOperatorDown.md" | head -1
# Also check the cluster-monitoring-operator directory structure
curl -s -I "https://raw.githubusercontent.com/openshift/runbooks/master/alerts/cluster-monitoring-operator/ClusterOperatorDown.md" | head -1Repository: openshift/cluster-version-operator
Length of output: 102
The runbook URL references a non-existent file.
The runbook_url at line 90 points to a non-existent path. Neither cluster-monitoring-operator/RiskApplies.md nor cluster-version-operator/RiskApplies.md exist in the runbooks repository.
Since RiskApplies is a cluster-version-operator-specific alert (it monitors cluster_version_risk_conditions), it should follow the pattern of other CVO alerts like ClusterVersionOperatorDown at line 47, which uses:
https://github.com/openshift/runbooks/blob/master/alerts/cluster-version-operator/RiskApplies.md
Either create the runbook at that location or update the URL to reference an existing runbook.
🤖 Prompt for AI Agents
In `@install/0000_90_cluster-version-operator_02_servicemonitor.yaml` around lines
86 - 95, The runbook_url for the alert RiskApplies points to a non-existent
path; update the runbook_url field in the RiskApplies alert stanza to use the
cluster-version-operator runbook path used by similar alerts (e.g., match the
URL used by ClusterVersionOperatorDown) — i.e., replace
https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/RiskApplies.md
with
https://github.com/openshift/runbooks/blob/master/alerts/cluster-version-operator/RiskApplies.md
or alternatively create the missing runbook at the original path so the current
URL remains valid.
There was a problem hiding this comment.
@hongkailiu, thank you for confirming! I'll wait for the update to the runbook URL.
05082f4 to
3b6c582
Compare
3b6c582 to
129bc80
Compare
|
Testing with 129bc80: Then The alert is
and
|
|
@hongkailiu: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/hold This feature has to be gated by TP. |


Follow up #1284 (comment)
The new metic is
cluster_version_risk_conditions{name, condition, risk}and the alert is firing if some risk applies to the cluster.