rhdh: add cluster health check and increase login timeout for hibernating clusters by gustavolira · Pull Request #76597 · openshift/release

gustavolira · 2026-03-20T12:29:36Z

Summary

Increase oc login timeout from 5 to 10 minutes to handle clusters resuming from hibernation
Increase retry interval from 20s to 30s to reduce log noise
Add cluster health check (oc get nodes) after login to verify API server is fully responsive before proceeding with service account creation

Context

The e2e-ocp-v4-20-helm-nightly job failed because the ephemeral OCP cluster was resuming from hibernation. The 5-minute login timeout was insufficient, and there was no validation that the API server was functional after login succeeded.

Most cluster pools (v4.16, v4.17, v4.19, v4.20, v4.21) have no runningCount, so clusters hibernate between jobs. Increasing runningCount was rejected due to AWS cost since nightly jobs only run 3x/week.

Build log: https://storage.googleapis.com/test-platform-results/logs/periodic-ci-redhat-developer-rhdh-release-1.9-e2e-ocp-v4-20-helm-nightly/2034140407427764224/build-log.txt

Ref: RHIDP-12736

Changes

Applied to all 3 OCP helm step scripts:

ocp/helm/nightly (nightly jobs)
ocp/helm (PR checks)
ocp/helm/upgrade/nightly (upgrade jobs)

Test plan

Verify nightly jobs pass on next scheduled run
Confirm health check logs appear in build output (Cluster API server is ready)

…ting clusters Ephemeral OCP clusters resume from hibernation and may not be fully responsive when CI jobs start. The existing 5-minute login timeout is insufficient for clusters that take longer to wake up, and there is no validation that the API server is functional after login succeeds. - Increase oc login timeout from 5 to 10 minutes - Increase retry interval from 20s to 30s - Add cluster health check (oc get nodes) after login to verify API server readiness before proceeding with SA creation Ref: RHIDP-12736 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

openshift-ci · 2026-03-20T12:30:11Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gustavolira

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~ci-operator/step-registry/redhat-developer/rhdh/ocp/helm/OWNERS~~ [gustavolira]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Protect credentials from trace logging, use robust if-timeout pattern instead of fragile $? checks, and show oc get nodes errors instead of suppressing all output. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

zdrapela · 2026-03-20T12:51:04Z

/pj-rehearse ?

openshift-ci-robot · 2026-03-20T12:51:08Z

@zdrapela: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

zdrapela · 2026-03-20T12:52:15Z

/pj-rehearse

openshift-ci-robot · 2026-03-20T12:52:18Z

@zdrapela: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

zdrapela · 2026-03-20T12:52:48Z

/pj-rehearse periodic-ci-redhat-developer-rhdh-release-1.9-e2e-ocp-v4-19-helm-nightly

openshift-ci-robot · 2026-03-20T12:52:52Z

@zdrapela: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

openshift-ci-robot · 2026-03-20T12:53:56Z

@zdrapela: job(s): ? either don't exist or were not found to be affected, and cannot be rehearsed

zdrapela · 2026-03-20T13:07:44Z

...ry/redhat-developer/rhdh/ocp/helm/nightly/redhat-developer-rhdh-ocp-helm-nightly-commands.sh


-timeout --foreground 5m bash <<-"EOF"
+# Disable tracing to protect credentials from leaking into CI logs
+set +x 2>/dev/null


We don't use set anywhere in the CI scripts, so the set +x is present by default. If you want to configure it explicitly, it makes sense to do it at the beginning of the commands.sh bash script? Because this is not the only sensitive data here.

zdrapela · 2026-03-20T13:12:30Z

...ry/redhat-developer/rhdh/ocp/helm/nightly/redhat-developer-rhdh-ocp-helm-nightly-commands.sh

 fi

+echo "========== Cluster Health Check =========="
+echo "Verifying cluster API server is fully responsive..."


This is more of a check for nodes to be ready, so why not echo what it actually is?
Maybe also oc wait --for=condition=Ready nodes --all --timeout=XXXs may be handy here?

openshift-ci · 2026-03-20T15:29:46Z

@gustavolira: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/rehearse/periodic-ci-redhat-developer-rhdh-main-e2e-ocp-v4-20-helm-nightly	`11b5c34`	link	unknown	`/pj-rehearse periodic-ci-redhat-developer-rhdh-main-e2e-ocp-v4-20-helm-nightly`
ci/rehearse/periodic-ci-redhat-developer-rhdh-release-1.9-e2e-ocp-v4-19-helm-nightly	`11b5c34`	link	unknown	`/pj-rehearse periodic-ci-redhat-developer-rhdh-release-1.9-e2e-ocp-v4-19-helm-nightly`
ci/rehearse/periodic-ci-redhat-developer-rhdh-release-1.9-e2e-ocp-v4-21-helm-nightly	`11b5c34`	link	unknown	`/pj-rehearse periodic-ci-redhat-developer-rhdh-release-1.9-e2e-ocp-v4-21-helm-nightly`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

zdrapela · 2026-03-20T20:46:35Z

/pj-rehearse periodic-ci-redhat-developer-rhdh-release-1.9-e2e-ocp-v4-19-helm-nightly

zdrapela · 2026-03-20T20:47:14Z

/pj-rehearse periodic-ci-redhat-developer-rhdh-main-e2e-ocp-v4-20-helm-nightly

openshift-ci bot requested review from josephca and subhashkhileri March 20, 2026 12:29

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 20, 2026

gustavolira mentioned this pull request Mar 20, 2026

revert: add cluster health check and retry logic for hibernating clusters #4417 redhat-developer/rhdh#4434

Merged

zdrapela reviewed Mar 20, 2026

View reviewed changes

Conversation

gustavolira commented Mar 20, 2026

Summary

Context

Changes

Test plan

Uh oh!

openshift-ci bot commented Mar 20, 2026

Uh oh!

zdrapela commented Mar 20, 2026

Uh oh!

openshift-ci-robot commented Mar 20, 2026

Uh oh!

zdrapela commented Mar 20, 2026

Uh oh!

openshift-ci-robot commented Mar 20, 2026

Uh oh!

zdrapela commented Mar 20, 2026

Uh oh!

openshift-ci-robot commented Mar 20, 2026

Uh oh!

openshift-ci-robot commented Mar 20, 2026

Uh oh!

zdrapela Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zdrapela Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Mar 20, 2026

Uh oh!

zdrapela commented Mar 20, 2026

Uh oh!

zdrapela commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zdrapela Mar 20, 2026 •

edited

Loading