Skip to content

rhdh: add cluster health check and increase login timeout for hibernating clusters#76597

Open
gustavolira wants to merge 2 commits intoopenshift:mainfrom
gustavolira:rhdh/cluster-health-check-retry
Open

rhdh: add cluster health check and increase login timeout for hibernating clusters#76597
gustavolira wants to merge 2 commits intoopenshift:mainfrom
gustavolira:rhdh/cluster-health-check-retry

Conversation

@gustavolira
Copy link
Member

Summary

  • Increase oc login timeout from 5 to 10 minutes to handle clusters resuming from hibernation
  • Increase retry interval from 20s to 30s to reduce log noise
  • Add cluster health check (oc get nodes) after login to verify API server is fully responsive before proceeding with service account creation

Context

The e2e-ocp-v4-20-helm-nightly job failed because the ephemeral OCP cluster was resuming from hibernation. The 5-minute login timeout was insufficient, and there was no validation that the API server was functional after login succeeded.

Most cluster pools (v4.16, v4.17, v4.19, v4.20, v4.21) have no runningCount, so clusters hibernate between jobs. Increasing runningCount was rejected due to AWS cost since nightly jobs only run 3x/week.

Build log: https://storage.googleapis.com/test-platform-results/logs/periodic-ci-redhat-developer-rhdh-release-1.9-e2e-ocp-v4-20-helm-nightly/2034140407427764224/build-log.txt

Ref: RHIDP-12736

Changes

Applied to all 3 OCP helm step scripts:

  • ocp/helm/nightly (nightly jobs)
  • ocp/helm (PR checks)
  • ocp/helm/upgrade/nightly (upgrade jobs)

Test plan

  • Verify nightly jobs pass on next scheduled run
  • Confirm health check logs appear in build output (Cluster API server is ready)

…ting clusters

Ephemeral OCP clusters resume from hibernation and may not be fully
responsive when CI jobs start. The existing 5-minute login timeout is
insufficient for clusters that take longer to wake up, and there is no
validation that the API server is functional after login succeeds.

- Increase oc login timeout from 5 to 10 minutes
- Increase retry interval from 20s to 30s
- Add cluster health check (oc get nodes) after login to verify
  API server readiness before proceeding with SA creation

Ref: RHIDP-12736

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 20, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gustavolira

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 20, 2026
Protect credentials from trace logging, use robust if-timeout pattern
instead of fragile $? checks, and show oc get nodes errors instead of
suppressing all output.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@zdrapela
Copy link
Contributor

/pj-rehearse ?

@openshift-ci-robot
Copy link
Contributor

@zdrapela: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@zdrapela
Copy link
Contributor

/pj-rehearse

@openshift-ci-robot
Copy link
Contributor

@zdrapela: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@zdrapela
Copy link
Contributor

/pj-rehearse periodic-ci-redhat-developer-rhdh-release-1.9-e2e-ocp-v4-19-helm-nightly

@openshift-ci-robot
Copy link
Contributor

@zdrapela: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-ci-robot
Copy link
Contributor

@zdrapela: job(s): ? either don't exist or were not found to be affected, and cannot be rehearsed


timeout --foreground 5m bash <<-"EOF"
# Disable tracing to protect credentials from leaking into CI logs
set +x 2>/dev/null
Copy link
Contributor

@zdrapela zdrapela Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use set anywhere in the CI scripts, so the set +x is present by default. If you want to configure it explicitly, it makes sense to do it at the beginning of the commands.sh bash script? Because this is not the only sensitive data here.

fi

echo "========== Cluster Health Check =========="
echo "Verifying cluster API server is fully responsive..."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more of a check for nodes to be ready, so why not echo what it actually is?
Maybe also oc wait --for=condition=Ready nodes --all --timeout=XXXs may be handy here?

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 20, 2026

@gustavolira: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/rehearse/periodic-ci-redhat-developer-rhdh-main-e2e-ocp-v4-20-helm-nightly 11b5c34 link unknown /pj-rehearse periodic-ci-redhat-developer-rhdh-main-e2e-ocp-v4-20-helm-nightly
ci/rehearse/periodic-ci-redhat-developer-rhdh-release-1.9-e2e-ocp-v4-19-helm-nightly 11b5c34 link unknown /pj-rehearse periodic-ci-redhat-developer-rhdh-release-1.9-e2e-ocp-v4-19-helm-nightly
ci/rehearse/periodic-ci-redhat-developer-rhdh-release-1.9-e2e-ocp-v4-21-helm-nightly 11b5c34 link unknown /pj-rehearse periodic-ci-redhat-developer-rhdh-release-1.9-e2e-ocp-v4-21-helm-nightly

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@zdrapela
Copy link
Contributor

/pj-rehearse periodic-ci-redhat-developer-rhdh-release-1.9-e2e-ocp-v4-19-helm-nightly

@zdrapela
Copy link
Contributor

/pj-rehearse periodic-ci-redhat-developer-rhdh-main-e2e-ocp-v4-20-helm-nightly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants