OCPBUGS-20056: EndpointAccessibleController: Retry the whole list endpoints and send request loop 3 times#855
Conversation
|
Skipping CI for Draft Pull Request. |
|
@tchap: This pull request references Jira Issue OCPBUGS-20056, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughController sync now retries full endpoint-list fetch plus parallel HTTP reachability checks up to configurable attempts with per-request timeouts, retry intervals, and an injectable HTTP client; resync jitter base made configurable. Tests use in-process httptest servers and add retry, stale-endpoint, and request-timeout coverage. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
089d5bc to
a1f718e
Compare
|
/jira refresh |
|
@tchap: This pull request references Jira Issue OCPBUGS-20056, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (ksiddiqu@redhat.com), skipping review request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
ed6281c to
795cde0
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/libs/endpointaccessible/endpoint_accessible_controller.go`:
- Around line 142-153: The code currently returns the raw error when
c.endpointListFn() fails for non-NotFound cases, which bypasses the whole-cycle
retry/status update path; change the non-NotFound branch to mirror the NotFound
handling: build an OperatorStatus for c.availableConditionName with
ConditionFalse, set a Reason like "ListError" and the error text via
err.Error(), then call c.operatorClient.ApplyOperatorStatus(ctx,
c.controllerInstanceName, status) and return that result instead of returning
err directly so transient lister/cache failures update operator status and enter
the same retry path.
- Around line 112-119: The production path uses buildTLSClient() which
hard-codes Timeout: 5 * time.Second and thus overrides the computed
requestTimeout (default 10s) so client-level timeout wins and breaks
configurable retry behavior; fix by either removing the fixed Timeout in
buildTLSClient() to rely on context deadlines or change the call site (where
client is nil) to pass the computed requestTimeout into
buildTLSClient(requestTimeout) and apply that value to the client's Timeout;
update the buildTLSClient function signature and any callers to accept and use
requestTimeout (or remove the Timeout field entirely) so the requestTimeout
variable takes effect for production calls that create the http client.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: aabf9163-8091-402f-895c-249924e7eb8c
📒 Files selected for processing (2)
pkg/libs/endpointaccessible/endpoint_accessible_controller.gopkg/libs/endpointaccessible/endpoint_accessible_controller_test.go
795cde0 to
37b3d01
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: tchap The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/libs/endpointaccessible/endpoint_accessible_controller.go`:
- Around line 145-148: The select that waits between retries currently ignores
the ctx.Done() branch and continues, so when ctx is canceled the controller
still runs another endpointListFn() and probe; update the select in the retry
loop inside the sync logic to exit the sync when ctx.Done() is selected (e.g.,
return from the enclosing function or break out of the retry loop) so that no
further endpointListFn() calls or probe passes occur after cancellation and you
don't immediately report EndpointUnavailable during shutdown.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a7514ce8-5dab-4adb-a476-6002444e71c1
📒 Files selected for processing (2)
pkg/libs/endpointaccessible/endpoint_accessible_controller.gopkg/libs/endpointaccessible/endpoint_accessible_controller_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/libs/endpointaccessible/endpoint_accessible_controller_test.go
37b3d01 to
ad9a720
Compare
|
@tchap: This pull request references Jira Issue OCPBUGS-20056, which is valid. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (ksiddiqu@redhat.com), skipping review request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
|
||
| defaultRequestTimeout = 10 * time.Second | ||
| defaultRetryInterval = 5 * time.Second | ||
| defaultAttemptCount = 3 |
There was a problem hiding this comment.
Not 100% sure about these values. It effectively means there is an inertial of 40 seconds before an actual Available=false issue surfaces. But there is always a trade-off.
There was a problem hiding this comment.
For a health check endpoint 10s request timeout feels a bit too long; the previous 5s is already generous enough in my view.
Regarding the retry interval, I don't have enough knowledge to reason with it -- the goal is to wait for the endpoints to be updated with fresh pod IPs after a rolling upgrade, and I do not know what constitutes enough time for this to happen. If you've tested this and are happy with it, that's good enough for me 👍
There was a problem hiding this comment.
I am changing this to 5s for timeout and 2s for retry, also skipping the backoff when there is a request timeout, so the wait interval between attempts is not prolonged for no reason.
|
I can see now that there is also openshift/library-go#2057 now, which handles I think that this PR makes sense on its own, though. |
| if err != nil { | ||
| errCh <- humanizeError(err) | ||
| return | ||
| for i := range attempts { |
There was a problem hiding this comment.
Would it make more sense to take advantage of the wait module and use something like wait.ExponentialBackoffWithContext for the retry-with-backoff loop?
There was a problem hiding this comment.
It might have been possible before, but I changed the code not to back off when there is a timeout, so I guess I will leave this as it is.
There was a problem hiding this comment.
Actually I can use the same mechanism as in openshift/library-go#2148 , once the PR is merged and the helper is available in library-go.
|
|
||
| defaultRequestTimeout = 10 * time.Second | ||
| defaultRetryInterval = 5 * time.Second | ||
| defaultAttemptCount = 3 |
There was a problem hiding this comment.
For a health check endpoint 10s request timeout feels a bit too long; the previous 5s is already generous enough in my view.
Regarding the retry interval, I don't have enough knowledge to reason with it -- the goal is to wait for the endpoints to be updated with fresh pod IPs after a rolling upgrade, and I do not know what constitutes enough time for this to happen. If you've tested this and are happy with it, that's good enough for me 👍
pkg/libs/endpointaccessible/endpoint_accessible_controller_test.go
Outdated
Show resolved
Hide resolved
The reason for this change is to get rid of intermittent Available=false, even during upgrades. This is achieved by retrying the whole fetch endpoints and send requests loop 3 times, including a short sleep between each attempt.
Also set timeout to 5s and backoff to 2s.
ad9a720 to
ff37cf7
Compare
|
/hold to squash before merging. |
There was a problem hiding this comment.
♻️ Duplicate comments (1)
pkg/libs/endpointaccessible/endpoint_accessible_controller.go (1)
156-168:⚠️ Potential issue | 🟠 MajorNon-
NotFoundlist errors still bypass the whole-cycle retryLine 168 returns immediately for non-
NotFoundendpointListFn()errors, so transient lister/cache failures skip the new retry loop entirely.Suggested fix
endpoints, err = c.endpointListFn() if err != nil { if apierrors.IsNotFound(err) { status := applyoperatorv1.OperatorStatus(). WithConditions(applyoperatorv1.OperatorCondition(). WithType(c.availableConditionName). WithStatus(operatorv1.ConditionFalse). WithReason("ResourceNotFound"). WithMessage(err.Error())) return c.operatorClient.ApplyOperatorStatus(ctx, c.controllerInstanceName, status) } - // Do not retry on endpoint list error since these are not transient. - return err + // Retry non-NotFound list errors as part of the full cycle. + if i == attempts-1 { + return err + } + continue }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/libs/endpointaccessible/endpoint_accessible_controller.go` around lines 156 - 168, The code returns immediately on non-NotFound errors from endpointListFn(), which prevents the outer whole-cycle retry from handling transient lister/cache failures; update the branch handling errors from c.endpointListFn() so that for non-NotFound errors you also set an OperatorStatus (similar to the ResourceNotFound branch) using c.availableConditionName and include the error message/reason (e.g., "EndpointListError" or "ListFailed"), call c.operatorClient.ApplyOperatorStatus(ctx, c.controllerInstanceName, status) to record the failure, and then return the original error so the outer retry loop can observe and retry; locate the logic around endpointListFn(), availableConditionName, controllerInstanceName and operatorClient.ApplyOperatorStatus to make the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@pkg/libs/endpointaccessible/endpoint_accessible_controller.go`:
- Around line 156-168: The code returns immediately on non-NotFound errors from
endpointListFn(), which prevents the outer whole-cycle retry from handling
transient lister/cache failures; update the branch handling errors from
c.endpointListFn() so that for non-NotFound errors you also set an
OperatorStatus (similar to the ResourceNotFound branch) using
c.availableConditionName and include the error message/reason (e.g.,
"EndpointListError" or "ListFailed"), call
c.operatorClient.ApplyOperatorStatus(ctx, c.controllerInstanceName, status) to
record the failure, and then return the original error so the outer retry loop
can observe and retry; locate the logic around endpointListFn(),
availableConditionName, controllerInstanceName and
operatorClient.ApplyOperatorStatus to make the change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 298a9e91-5242-49c4-8f66-9b286fd467a2
📒 Files selected for processing (2)
pkg/libs/endpointaccessible/endpoint_accessible_controller.gopkg/libs/endpointaccessible/endpoint_accessible_controller_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/libs/endpointaccessible/endpoint_accessible_controller_test.go
|
@tchap: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
The reason for this change is to get rid of intermittent
Available=false, possibly even during upgrades.This is achieved by retrying the whole fetch endpoints and send requests loop 3 times, including a short sleep between each attempt.
Mostly trying to improve CI, though. A failing job example.