rhdh: add cluster health check and increase login timeout for hibernating clusters#76597
rhdh: add cluster health check and increase login timeout for hibernating clusters#76597gustavolira wants to merge 2 commits intoopenshift:mainfrom
Conversation
…ting clusters Ephemeral OCP clusters resume from hibernation and may not be fully responsive when CI jobs start. The existing 5-minute login timeout is insufficient for clusters that take longer to wake up, and there is no validation that the API server is functional after login succeeds. - Increase oc login timeout from 5 to 10 minutes - Increase retry interval from 20s to 30s - Add cluster health check (oc get nodes) after login to verify API server readiness before proceeding with SA creation Ref: RHIDP-12736 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gustavolira The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Protect credentials from trace logging, use robust if-timeout pattern instead of fragile $? checks, and show oc get nodes errors instead of suppressing all output. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/pj-rehearse ? |
|
@zdrapela: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse |
|
@zdrapela: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse periodic-ci-redhat-developer-rhdh-release-1.9-e2e-ocp-v4-19-helm-nightly |
|
@zdrapela: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
@zdrapela: job(s): ? either don't exist or were not found to be affected, and cannot be rehearsed |
|
|
||
| timeout --foreground 5m bash <<-"EOF" | ||
| # Disable tracing to protect credentials from leaking into CI logs | ||
| set +x 2>/dev/null |
There was a problem hiding this comment.
We don't use set anywhere in the CI scripts, so the set +x is present by default. If you want to configure it explicitly, it makes sense to do it at the beginning of the commands.sh bash script? Because this is not the only sensitive data here.
| fi | ||
|
|
||
| echo "========== Cluster Health Check ==========" | ||
| echo "Verifying cluster API server is fully responsive..." |
There was a problem hiding this comment.
This is more of a check for nodes to be ready, so why not echo what it actually is?
Maybe also oc wait --for=condition=Ready nodes --all --timeout=XXXs may be handy here?
|
@gustavolira: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/pj-rehearse periodic-ci-redhat-developer-rhdh-release-1.9-e2e-ocp-v4-19-helm-nightly |
|
/pj-rehearse periodic-ci-redhat-developer-rhdh-main-e2e-ocp-v4-20-helm-nightly |
Summary
oc logintimeout from 5 to 10 minutes to handle clusters resuming from hibernationoc get nodes) after login to verify API server is fully responsive before proceeding with service account creationContext
The
e2e-ocp-v4-20-helm-nightlyjob failed because the ephemeral OCP cluster was resuming from hibernation. The 5-minute login timeout was insufficient, and there was no validation that the API server was functional after login succeeded.Most cluster pools (v4.16, v4.17, v4.19, v4.20, v4.21) have no
runningCount, so clusters hibernate between jobs. IncreasingrunningCountwas rejected due to AWS cost since nightly jobs only run 3x/week.Build log: https://storage.googleapis.com/test-platform-results/logs/periodic-ci-redhat-developer-rhdh-release-1.9-e2e-ocp-v4-20-helm-nightly/2034140407427764224/build-log.txt
Ref: RHIDP-12736
Changes
Applied to all 3 OCP helm step scripts:
ocp/helm/nightly(nightly jobs)ocp/helm(PR checks)ocp/helm/upgrade/nightly(upgrade jobs)Test plan
Cluster API server is ready)