More test failures with latest release#653
More test failures with latest release#653annismckenzie wants to merge 2 commits intoreconcilerio:mainfrom
Conversation
Signed-off-by: Daniel Lohse <info@asapdesign.de>
…status Signed-off-by: Daniel Lohse <info@asapdesign.de>
| result, err := r.AfterReconcile(ctx, req, AggregateResults(beforeResult, reconcileResult), err) | ||
| if errors.Is(err, ErrQuiet) { | ||
| // suppress error, while forcing a requeue | ||
| if result.RequeueAfter > 0 { // honor requeue after returned by reconciler |
There was a problem hiding this comment.
This should be okay to add.
There was a problem hiding this comment.
Added a use case below and I really think this is worth to add.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #653 +/- ##
==========================================
+ Coverage 57.28% 60.08% +2.79%
==========================================
Files 39 39
Lines 4507 4587 +80
==========================================
+ Hits 2582 2756 +174
+ Misses 1811 1723 -88
+ Partials 114 108 -6 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@annismckenzie thanks for raising this. I'm willing to be convinced otherwise, but my inclination is that the current behavior is correct. The newly failing tests in your project may be enforcing undesirable behavior. If you're willing/able to share an example we can work through it. For the workqueue managing requests on the reconciler, returning an error or Requeue=true are equivalent. Both will cause that request to be reprocessed "immediately" (after an increasing backoff). The intent behind quiet errors is treat them like an error from the lifecycle's perspective, but with less noise in the logs. RequeueAfter introduces an independent delay, which is ignored if an error is returned. So by returning a result with RequeueAfter for a quiet error is changing the runtime semantics. After a status update (or any mutation of a resource) the copy of that resource in the informer cache is stale. Reprocessing the same resource will cause the same mutation to be attempted and fail server side, wasting resources and creating noise in the logs. Waiting for the informer to receive the updated resource will immediately queue that resource for processing. So even if the workqueue had a scheduled requeue for that resource, it would be preempted by the observed update. By propagating ReuqueAfter when we made a mutation no meaningful work can result. It will either reprocess the stale resource, or be preempted in the queue by the refreshed resource. |
|
I've been thinking about this for some time, even while trying to fix the test failures and I agree with your assessment. Can we keep this open for a few days so I can take a look at the test suite again and rework the assertions? Then I'll close it. |
|
Fixed our test suite – closing. |
|
@scothis: sorry, need to ask one more question. Can you take a look at the following sub reconciler? const waitDuration = time.Minute * 3
// WaitForUnhealthyHealthCheckSubReconciler is a sub reconciler that waits for the health check created by the previous sub reconciler to become unhealthy.
// To do this, it first waits for the appropriate amount of time before making sure via the Route 53 API. It will abort and requeue in 1 minute if these
// prerequisites haven't been met yet.
func WaitForUnhealthyHealthCheckSubReconciler(clusterResourceStasher ClusterResourceStash) reconcilers.SubReconciler[*operatorsv1alpha1.DownstreamRecord] {
return &reconcilers.SyncReconciler[*operatorsv1alpha1.DownstreamRecord]{
SyncWithResult: func(ctx context.Context, resource *operatorsv1alpha1.DownstreamRecord) (reconcilers.Result, error) {
cluster := clusterResourceStasher.RetrieveOrDie(ctx)
cm := resource.Status.ConditionManager(ctx)
conditionHealthCheckCreated := cm.GetCondition(operatorsv1alpha1.ConditionHealthCheckCreated)
waitTime := conditionHealthCheckCreated.LastTransitionTime.Add(waitDuration)
now := rtime.RetrieveNow(ctx).UTC()
if apis.ConditionIsTrue(conditionHealthCheckCreated) && now.After(waitTime) { // we waited long enough, i.e. the current time has reached the wait time
// in order to make sure that the health check is unhealthy we have to call the Route 53 API
…
}
log.FromContext(ctx).Info("waiting for health check aggregate state to be unhealthy",
"check", fmt.Sprintf("'%s' < '%s' < '%s' (created/now/waiting until)", conditionHealthCheckCreated.LastTransitionTime, now, waitTime))
return reconcilers.Result{RequeueAfter: time.Minute}, reconcilers.ErrHaltSubReconcilers // stop chain of sub reconcilers and requeue in a minute
},
}
}I added a separate unit test for this but it's failing: I hope the code is readable, I removed everything that wasn't relevant. Here's a quick rundown of what I'm trying to achieve:
return reconcilers.Result{RequeueAfter: time.Minute}, reconcilers.ErrHaltSubReconcilers // stop chain of sub reconcilers and requeue in a minuteI did make sure to check that no status updates or the like are triggered in the unit test which would cause another requeue – for reference, here's the test: "requeues when the timeout hasn't been reached yet": {
Request: testRequest,
GivenObjects: []client.Object{
…
},
Metadata: map[string]interface{}{
…
},
StatusSubResourceTypes: []client.Object{&operatorsv1alpha1.DownstreamRecord{}},
ExpectTracks: []rtesting.TrackRequest{
…
},
ExpectedResult: reconcile.Result{
RequeueAfter: time.Minute,
},
},As it is right now, this would requeue immediately and then be throttled eventually, while incurring a lot of useless loops. I thought about using result, err := r.AfterReconcile(ctx, req, AggregateResults(beforeResult, reconcileResult), err)
if errors.Is(err, ErrQuiet) {
// suppress error, while forcing a requeue
if result.RequeueAfter > 0 { // honor requeue after returned by reconciler
return result, nil
}
return Result{Requeue: true}, nil
}
return result, err |
|
@annismckenzie that's a good example that breaks the current model. I assume there is no effective way to watch the remote API for changes like you would a k8s resource. Minimally, wrapping the check and subsequent reconcilers in an |
|
No, it's AWS Health Checks via Route 53. No need to rush, I'm not blocked. This all started on Friday when I saw the MR opened by Renovate to upgrade the library and I thought "easy Friday afternoon thing to do, hold my 🍺". 😄 |
After upgrading to the latest release, more test failures surfaced in our test suites. I tracked those down to changes between v0.23.0 and v.0.24.0.
I'll add some comments in the code as well.