OCPBUGS-20056: EndpointAccessibleController: Retry the whole list endpoints and send request loop 3 times by tchap · Pull Request #855 · openshift/cluster-authentication-operator

tchap · 2026-03-24T10:36:38Z

The reason for this change is to get rid of intermittent Available=false, possibly even during upgrades.
This is achieved by retrying the whole fetch endpoints and send requests loop 3 times, including a short sleep between each attempt.

Mostly trying to improve CI, though. A failing job example.

openshift-ci · 2026-03-24T10:36:42Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci-robot · 2026-03-24T10:36:45Z

@tchap: This pull request references Jira Issue OCPBUGS-20056, which is invalid:

expected the bug to target the "4.22.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

The reason for this change is to get rid of intermittent Available=false, even during upgrades.
This is achieved by retrying the whole fetch endpoints and send requests loop 3 times, including a short sleep between each attempt.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-03-24T10:36:46Z

Walkthrough

Controller sync now retries full endpoint-list fetch plus parallel HTTP reachability checks up to configurable attempts with per-request timeouts, retry intervals, and an injectable HTTP client; resync jitter base made configurable. Tests use in-process httptest servers and add retry, stale-endpoint, and request-timeout coverage.

Changes

Cohort / File(s)	Summary
Controller Implementation `pkg/libs/endpointaccessible/endpoint_accessible_controller.go`	Replaced fixed resync jitter with `resyncInterval`. Added configurable fields: `httpClient`, `requestTimeout`, `retryInterval`, `attemptCount`. `sync` now lazily builds/uses `c.httpClient`, retries full fetch+reachability up to `attemptCount` re-fetching endpoints per attempt, applies per-request context timeouts, sleeps `retryInterval` between attempts (skipping backoff after request timeouts), treats endpoint-list errors as `ResourceNotFound` degraded status without retrying, short-circuits on at least one successful endpoint, and removed global `http.Client` timeout from TLS client construction.
Controller Tests `pkg/libs/endpointaccessible/endpoint_accessible_controller_test.go`	Renamed primary test, switched external URLs to `httptest` servers, added `newSyncContext(t)` helper, and configured controller with `httpClient`, `attemptCount`, and per-test timeouts. Added tests for: retry on transient 500→200 with `retryInterval` behavior; re-invocation of `endpointListFn` across retries when endpoint list changes; per-request timeout enforcement with retries that skip backoff after deadline timeouts. Added test fakes: `hangingTransport`, `failFastTransport`, and supporting synctest utilities.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

tchap · 2026-03-24T11:46:23Z

/jira refresh

openshift-ci-robot · 2026-03-24T11:46:31Z

@tchap: This pull request references Jira Issue OCPBUGS-20056, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.22.0) matches configured target version for branch (4.22.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (ksiddiqu@redhat.com), skipping review request.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

tchap · 2026-03-24T13:44:49Z

@coderabbitai review

coderabbitai · 2026-03-24T13:44:58Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/libs/endpointaccessible/endpoint_accessible_controller.go`:
- Around line 142-153: The code currently returns the raw error when
c.endpointListFn() fails for non-NotFound cases, which bypasses the whole-cycle
retry/status update path; change the non-NotFound branch to mirror the NotFound
handling: build an OperatorStatus for c.availableConditionName with
ConditionFalse, set a Reason like "ListError" and the error text via
err.Error(), then call c.operatorClient.ApplyOperatorStatus(ctx,
c.controllerInstanceName, status) and return that result instead of returning
err directly so transient lister/cache failures update operator status and enter
the same retry path.
- Around line 112-119: The production path uses buildTLSClient() which
hard-codes Timeout: 5 * time.Second and thus overrides the computed
requestTimeout (default 10s) so client-level timeout wins and breaks
configurable retry behavior; fix by either removing the fixed Timeout in
buildTLSClient() to rely on context deadlines or change the call site (where
client is nil) to pass the computed requestTimeout into
buildTLSClient(requestTimeout) and apply that value to the client's Timeout;
update the buildTLSClient function signature and any callers to accept and use
requestTimeout (or remove the Timeout field entirely) so the requestTimeout
variable takes effect for production calls that create the http client.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: aabf9163-8091-402f-895c-249924e7eb8c

📥 Commits

Reviewing files that changed from the base of the PR and between 78d19ea and 795cde0.

📒 Files selected for processing (2)

pkg/libs/endpointaccessible/endpoint_accessible_controller.go
pkg/libs/endpointaccessible/endpoint_accessible_controller_test.go

pkg/libs/endpointaccessible/endpoint_accessible_controller.go

openshift-ci · 2026-03-24T23:23:26Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tchap
Once this PR has been reviewed and has the lgtm label, please assign ibihim for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/libs/endpointaccessible/endpoint_accessible_controller.go`:
- Around line 145-148: The select that waits between retries currently ignores
the ctx.Done() branch and continues, so when ctx is canceled the controller
still runs another endpointListFn() and probe; update the select in the retry
loop inside the sync logic to exit the sync when ctx.Done() is selected (e.g.,
return from the enclosing function or break out of the retry loop) so that no
further endpointListFn() calls or probe passes occur after cancellation and you
don't immediately report EndpointUnavailable during shutdown.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a7514ce8-5dab-4adb-a476-6002444e71c1

📥 Commits

Reviewing files that changed from the base of the PR and between 795cde0 and 37b3d01.

📒 Files selected for processing (2)

pkg/libs/endpointaccessible/endpoint_accessible_controller.go
pkg/libs/endpointaccessible/endpoint_accessible_controller_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

pkg/libs/endpointaccessible/endpoint_accessible_controller_test.go

pkg/libs/endpointaccessible/endpoint_accessible_controller.go

openshift-ci-robot · 2026-03-25T07:04:35Z

@tchap: This pull request references Jira Issue OCPBUGS-20056, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.22.0) matches configured target version for branch (4.22.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (ksiddiqu@redhat.com), skipping review request.

Details

In response to this:

The reason for this change is to get rid of intermittent Available=false, possibly even during upgrades.
This is achieved by retrying the whole fetch endpoints and send requests loop 3 times, including a short sleep between each attempt.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

tchap · 2026-03-25T07:25:08Z

pkg/libs/endpointaccessible/endpoint_accessible_controller.go

+
+	defaultRequestTimeout = 10 * time.Second
+	defaultRetryInterval  = 5 * time.Second
+	defaultAttemptCount   = 3


Not 100% sure about these values. It effectively means there is an inertial of 40 seconds before an actual Available=false issue surfaces. But there is always a trade-off.

For a health check endpoint 10s request timeout feels a bit too long; the previous 5s is already generous enough in my view.

Regarding the retry interval, I don't have enough knowledge to reason with it -- the goal is to wait for the endpoints to be updated with fresh pod IPs after a rolling upgrade, and I do not know what constitutes enough time for this to happen. If you've tested this and are happy with it, that's good enough for me 👍

I am changing this to 5s for timeout and 2s for retry, also skipping the backoff when there is a request timeout, so the wait interval between attempts is not prolonged for no reason.

tchap · 2026-03-25T14:35:15Z

I can see now that there is also openshift/library-go#2057 now, which handles Available inertia globally. We need to actually decide which way to go, I guess...

I think that this PR makes sense on its own, though.

pkg/libs/endpointaccessible/endpoint_accessible_controller.go

liouk · 2026-03-25T14:43:57Z

pkg/libs/endpointaccessible/endpoint_accessible_controller.go

-			if err != nil {
-				errCh <- humanizeError(err)
-				return
+	for i := range attempts {


Would it make more sense to take advantage of the wait module and use something like wait.ExponentialBackoffWithContext for the retry-with-backoff loop?

It might have been possible before, but I changed the code not to back off when there is a timeout, so I guess I will leave this as it is.

Actually I can use the same mechanism as in openshift/library-go#2148 , once the PR is merged and the helper is available in library-go.

liouk · 2026-03-25T14:48:12Z

pkg/libs/endpointaccessible/endpoint_accessible_controller.go

+
+	defaultRequestTimeout = 10 * time.Second
+	defaultRetryInterval  = 5 * time.Second
+	defaultAttemptCount   = 3


For a health check endpoint 10s request timeout feels a bit too long; the previous 5s is already generous enough in my view.

Regarding the retry interval, I don't have enough knowledge to reason with it -- the goal is to wait for the endpoints to be updated with fresh pod IPs after a rolling upgrade, and I do not know what constitutes enough time for this to happen. If you've tested this and are happy with it, that's good enough for me 👍

pkg/libs/endpointaccessible/endpoint_accessible_controller.go

pkg/libs/endpointaccessible/endpoint_accessible_controller_test.go

The reason for this change is to get rid of intermittent Available=false, even during upgrades. This is achieved by retrying the whole fetch endpoints and send requests loop 3 times, including a short sleep between each attempt.

Also set timeout to 5s and backoff to 2s.

tchap · 2026-04-01T12:18:54Z

/hold

to squash before merging.

coderabbitai

♻️ Duplicate comments (1)

pkg/libs/endpointaccessible/endpoint_accessible_controller.go (1)

156-168: ⚠️ Potential issue | 🟠 Major

Non-NotFound list errors still bypass the whole-cycle retry

Line 168 returns immediately for non-NotFound endpointListFn() errors, so transient lister/cache failures skip the new retry loop entirely.

Suggested fix

 		endpoints, err = c.endpointListFn()
 		if err != nil {
 			if apierrors.IsNotFound(err) {
 				status := applyoperatorv1.OperatorStatus().
 					WithConditions(applyoperatorv1.OperatorCondition().
 						WithType(c.availableConditionName).
 						WithStatus(operatorv1.ConditionFalse).
 						WithReason("ResourceNotFound").
 						WithMessage(err.Error()))
 				return c.operatorClient.ApplyOperatorStatus(ctx, c.controllerInstanceName, status)
 			}
-			// Do not retry on endpoint list error since these are not transient.
-			return err
+			// Retry non-NotFound list errors as part of the full cycle.
+			if i == attempts-1 {
+				return err
+			}
+			continue
 		}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/libs/endpointaccessible/endpoint_accessible_controller.go` around lines
156 - 168, The code returns immediately on non-NotFound errors from
endpointListFn(), which prevents the outer whole-cycle retry from handling
transient lister/cache failures; update the branch handling errors from
c.endpointListFn() so that for non-NotFound errors you also set an
OperatorStatus (similar to the ResourceNotFound branch) using
c.availableConditionName and include the error message/reason (e.g.,
"EndpointListError" or "ListFailed"), call
c.operatorClient.ApplyOperatorStatus(ctx, c.controllerInstanceName, status) to
record the failure, and then return the original error so the outer retry loop
can observe and retry; locate the logic around endpointListFn(),
availableConditionName, controllerInstanceName and
operatorClient.ApplyOperatorStatus to make the change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/libs/endpointaccessible/endpoint_accessible_controller.go`:
- Around line 156-168: The code returns immediately on non-NotFound errors from
endpointListFn(), which prevents the outer whole-cycle retry from handling
transient lister/cache failures; update the branch handling errors from
c.endpointListFn() so that for non-NotFound errors you also set an
OperatorStatus (similar to the ResourceNotFound branch) using
c.availableConditionName and include the error message/reason (e.g.,
"EndpointListError" or "ListFailed"), call
c.operatorClient.ApplyOperatorStatus(ctx, c.controllerInstanceName, status) to
record the failure, and then return the original error so the outer retry loop
can observe and retry; locate the logic around endpointListFn(),
availableConditionName, controllerInstanceName and
operatorClient.ApplyOperatorStatus to make the change.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 298a9e91-5242-49c4-8f66-9b286fd467a2

📥 Commits

Reviewing files that changed from the base of the PR and between ad9a720 and 306ee64.

📒 Files selected for processing (2)

pkg/libs/endpointaccessible/endpoint_accessible_controller.go
pkg/libs/endpointaccessible/endpoint_accessible_controller_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

pkg/libs/endpointaccessible/endpoint_accessible_controller_test.go

openshift-ci · 2026-04-01T16:16:58Z

@tchap: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-operator-parallel-ote	`306ee64`	link	false	`/test e2e-aws-operator-parallel-ote`
ci/prow/e2e-aws-operator-encryption-rotation-serial-ote-1of2	`306ee64`	link	false	`/test e2e-aws-operator-encryption-rotation-serial-ote-1of2`
ci/prow/e2e-aws-operator-encryption-perf-serial-ote-1of2	`306ee64`	link	false	`/test e2e-aws-operator-encryption-perf-serial-ote-1of2`
ci/prow/e2e-aws-operator-encryption-perf-serial-ote-2of2	`306ee64`	link	false	`/test e2e-aws-operator-encryption-perf-serial-ote-2of2`
ci/prow/e2e-aws-operator-encryption-kms-serial-ote-1of2	`306ee64`	link	false	`/test e2e-aws-operator-encryption-kms-serial-ote-1of2`
ci/prow/e2e-aws-operator-encryption-serial-ote-1of2	`306ee64`	link	false	`/test e2e-aws-operator-encryption-serial-ote-1of2`
ci/prow/e2e-aws-operator-encryption-rotation-serial-ote-2of2	`306ee64`	link	false	`/test e2e-aws-operator-encryption-rotation-serial-ote-2of2`
ci/prow/e2e-aws-operator-encryption-serial-ote-2of2	`306ee64`	link	false	`/test e2e-aws-operator-encryption-serial-ote-2of2`
ci/prow/e2e-aws-operator-serial-ote	`306ee64`	link	false	`/test e2e-aws-operator-serial-ote`
ci/prow/e2e-aws-operator-encryption-kms-serial-ote-2of2	`306ee64`	link	false	`/test e2e-aws-operator-encryption-kms-serial-ote-2of2`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 24, 2026

tchap force-pushed the EndpointUnavailable-retry branch 3 times, most recently from 089d5bc to a1f718e Compare March 24, 2026 11:45

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 24, 2026

tchap force-pushed the EndpointUnavailable-retry branch 12 times, most recently from ed6281c to 795cde0 Compare March 24, 2026 13:22

tchap marked this pull request as ready for review March 24, 2026 13:24

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 24, 2026

openshift-ci bot requested review from ibihim and liouk March 24, 2026 13:26

coderabbitai bot reviewed Mar 24, 2026

View reviewed changes

pkg/libs/endpointaccessible/endpoint_accessible_controller.go Show resolved Hide resolved

pkg/libs/endpointaccessible/endpoint_accessible_controller.go Show resolved Hide resolved

tchap force-pushed the EndpointUnavailable-retry branch from 795cde0 to 37b3d01 Compare March 24, 2026 23:22

coderabbitai bot reviewed Mar 24, 2026

View reviewed changes

pkg/libs/endpointaccessible/endpoint_accessible_controller.go Show resolved Hide resolved

tchap force-pushed the EndpointUnavailable-retry branch from 37b3d01 to ad9a720 Compare March 25, 2026 06:55

tchap commented Mar 25, 2026

View reviewed changes

liouk suggested changes Mar 25, 2026

View reviewed changes

tchap added 5 commits April 1, 2026 13:25

EndpointAccessibleController: Try the check 3 times

c01b81e

The reason for this change is to get rid of intermittent Available=false, even during upgrades. This is achieved by retrying the whole fetch endpoints and send requests loop 3 times, including a short sleep between each attempt.

Don't back off when there was a request timeout

a90465b

Also set timeout to 5s and backoff to 2s.

Fix field docs

fd9b2d6

Do not retry on endpointListFn error

d531184

Use http.DefaultClient in tests.

ff37cf7

tchap force-pushed the EndpointUnavailable-retry branch from ad9a720 to ff37cf7 Compare April 1, 2026 12:16

Fix import block

306ee64

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 1, 2026

coderabbitai bot reviewed Apr 1, 2026

View reviewed changes

Conversation

tchap commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Mar 24, 2026

Uh oh!

openshift-ci-robot commented Mar 24, 2026

Uh oh!

coderabbitai bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

tchap commented Mar 24, 2026

Uh oh!

openshift-ci-robot commented Mar 24, 2026

Uh oh!

tchap commented Mar 24, 2026

Uh oh!

coderabbitai bot commented Mar 24, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

openshift-ci bot commented Mar 24, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

openshift-ci-robot commented Mar 25, 2026

Uh oh!

tchap Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

liouk Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

tchap Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

tchap commented Mar 25, 2026

Uh oh!

Uh oh!

liouk Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

tchap Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

tchap Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

liouk Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tchap commented Apr 1, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tchap commented Mar 24, 2026 •

edited

Loading

coderabbitai bot commented Mar 24, 2026 •

edited

Loading