More test failures with latest release by annismckenzie · Pull Request #653 · reconcilerio/runtime

annismckenzie · 2025-09-26T13:11:45Z

After upgrading to the latest release, more test failures surfaced in our test suites. I tracked those down to changes between v0.23.0 and v.0.24.0.

I'll add some comments in the code as well.

Signed-off-by: Daniel Lohse <info@asapdesign.de>

…status Signed-off-by: Daniel Lohse <info@asapdesign.de>

annismckenzie · 2025-09-26T13:12:18Z

reconcilers/resource.go

 	result, err := r.AfterReconcile(ctx, req, AggregateResults(beforeResult, reconcileResult), err)
 	if errors.Is(err, ErrQuiet) {
 		// suppress error, while forcing a requeue
+		if result.RequeueAfter > 0 { // honor requeue after returned by reconciler


This should be okay to add.

Added a use case below and I really think this is worth to add.

reconcilers/resource.go

codecov · 2025-09-27T14:04:48Z

Codecov Report

❌ Patch coverage is 83.33333% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 60.08%. Comparing base (ec87995) to head (662d869).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
reconcilers/resource.go	83.33%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #653      +/-   ##
==========================================
+ Coverage   57.28%   60.08%   +2.79%     
==========================================
  Files          39       39              
  Lines        4507     4587      +80     
==========================================
+ Hits         2582     2756     +174     
+ Misses       1811     1723      -88     
+ Partials      114      108       -6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

scothis · 2025-09-27T14:16:10Z

@annismckenzie thanks for raising this.

I'm willing to be convinced otherwise, but my inclination is that the current behavior is correct. The newly failing tests in your project may be enforcing undesirable behavior. If you're willing/able to share an example we can work through it.

For the workqueue managing requests on the reconciler, returning an error or Requeue=true are equivalent. Both will cause that request to be reprocessed "immediately" (after an increasing backoff). The intent behind quiet errors is treat them like an error from the lifecycle's perspective, but with less noise in the logs.

RequeueAfter introduces an independent delay, which is ignored if an error is returned. So by returning a result with RequeueAfter for a quiet error is changing the runtime semantics.

After a status update (or any mutation of a resource) the copy of that resource in the informer cache is stale. Reprocessing the same resource will cause the same mutation to be attempted and fail server side, wasting resources and creating noise in the logs. Waiting for the informer to receive the updated resource will immediately queue that resource for processing. So even if the workqueue had a scheduled requeue for that resource, it would be preempted by the observed update. By propagating ReuqueAfter when we made a mutation no meaningful work can result. It will either reprocess the stale resource, or be preempted in the queue by the refreshed resource.

annismckenzie · 2025-09-28T20:53:56Z

I've been thinking about this for some time, even while trying to fix the test failures and I agree with your assessment. Can we keep this open for a few days so I can take a look at the test suite again and rework the assertions? Then I'll close it.

annismckenzie · 2025-09-29T13:13:15Z

Fixed our test suite – closing.

annismckenzie · 2025-09-29T14:17:12Z

@scothis: sorry, need to ask one more question. Can you take a look at the following sub reconciler?

const waitDuration = time.Minute * 3

// WaitForUnhealthyHealthCheckSubReconciler is a sub reconciler that waits for the health check created by the previous sub reconciler to become unhealthy.
// To do this, it first waits for the appropriate amount of time before making sure via the Route 53 API. It will abort and requeue in 1 minute if these
// prerequisites haven't been met yet.
func WaitForUnhealthyHealthCheckSubReconciler(clusterResourceStasher ClusterResourceStash) reconcilers.SubReconciler[*operatorsv1alpha1.DownstreamRecord] {
  return &reconcilers.SyncReconciler[*operatorsv1alpha1.DownstreamRecord]{
    SyncWithResult: func(ctx context.Context, resource *operatorsv1alpha1.DownstreamRecord) (reconcilers.Result, error) {
      cluster := clusterResourceStasher.RetrieveOrDie(ctx)
      cm := resource.Status.ConditionManager(ctx)

      conditionHealthCheckCreated := cm.GetCondition(operatorsv1alpha1.ConditionHealthCheckCreated)
      waitTime := conditionHealthCheckCreated.LastTransitionTime.Add(waitDuration)
      now := rtime.RetrieveNow(ctx).UTC()
      if apis.ConditionIsTrue(conditionHealthCheckCreated) && now.After(waitTime) { // we waited long enough, i.e. the current time has reached the wait time
        // in order to make sure that the health check is unhealthy we have to call the Route 53 API
        …
      }

      log.FromContext(ctx).Info("waiting for health check aggregate state to be unhealthy",
        "check", fmt.Sprintf("'%s' < '%s' < '%s' (created/now/waiting until)", conditionHealthCheckCreated.LastTransitionTime, now, waitTime))
      return reconcilers.Result{RequeueAfter: time.Minute}, reconcilers.ErrHaltSubReconcilers // stop chain of sub reconcilers and requeue in a minute
    },
  }
}

I added a separate unit test for this but it's failing:

=== RUN   TestDownstreamRecordReconciler
=== RUN   TestDownstreamRecordReconciler/requeues_when_the_timeout_hasn't_been_reached_yet
    /Users/annismckenzie/git/go/src/git.alfa.sx/alfatraining/ops/services/cluster-operator-new/internal/controller/operators/downstreamrecord.go:272: DownstreamRecordResourceReconciler/WithFinalizer/4/: "level"=0 "msg"="waiting for health check aggregate state to be unhealthy" "resourceType"="operators.alfaview.com/v1alpha1, Kind=DownstreamRecord" "check"="'2025-09-29 13:53:34 +0000 UTC' < '2025-09-29 13:54:34.427286 +0000 UTC' < '2025-09-29 13:56:34 +0000 UTC' (created/now/waiting until)"
    /Users/annismckenzie/git/go/src/git.alfa.sx/alfatraining/ops/services/cluster-operator-new/internal/controller/operators/downstreamrecord_controller_test.go:337: ExpectedResult differs (-expected, +actual):   reconcile.Result{
        - 	Requeue:      false,
        + 	Requeue:      true,
        - 	RequeueAfter: s"1m0s",
        + 	RequeueAfter: s"0s",
          }

I hope the code is readable, I removed everything that wasn't relevant. Here's a quick rundown of what I'm trying to achieve:

We're creating a health check in a previous sub reconciler. In order to continue we need to query an API for its status but that only makes sense after at least 3 minutes have passed.
The creation time of that health check is tracked in a condition.
If the 3 minutes haven't been reached yet, we just return with:

return reconcilers.Result{RequeueAfter: time.Minute}, reconcilers.ErrHaltSubReconcilers // stop chain of sub reconcilers and requeue in a minute

I did make sure to check that no status updates or the like are triggered in the unit test which would cause another requeue – for reference, here's the test:

"requeues when the timeout hasn't been reached yet": {
	Request: testRequest,
	GivenObjects: []client.Object{
		…
	},
	Metadata: map[string]interface{}{
		…
	},
	StatusSubResourceTypes: []client.Object{&operatorsv1alpha1.DownstreamRecord{}},
	ExpectTracks: []rtesting.TrackRequest{
		…
	},
	ExpectedResult: reconcile.Result{
		RequeueAfter: time.Minute,
	},
},

As it is right now, this would requeue immediately and then be throttled eventually, while incurring a lot of useless loops. I thought about using Advice but that won't help here – I mean, I could use time.Sleep(time.Second*60) in the Around case but I really don't want to do that. I need to somehow stop the rest of the subreconcilers in the Sequence which is why I'm using ErrHaltSubReconcilers. That also wraps ErrQuiet. This would be fixed by the clause I added in resource.go:

result, err := r.AfterReconcile(ctx, req, AggregateResults(beforeResult, reconcileResult), err)
if errors.Is(err, ErrQuiet) {
	// suppress error, while forcing a requeue
	if result.RequeueAfter > 0 { // honor requeue after returned by reconciler
		return result, nil
	}
	return Result{Requeue: true}, nil
}
return result, err

scothis · 2025-09-29T14:36:00Z

@annismckenzie that's a good example that breaks the current model. I assume there is no effective way to watch the remote API for changes like you would a k8s resource.

Minimally, wrapping the check and subsequent reconcilers in an IfThen should unblock you. Let me think a bit if there is a more elegant way to approach this.

annismckenzie · 2025-09-29T14:46:32Z

No, it's AWS Health Checks via Route 53. No need to rush, I'm not blocked. This all started on Friday when I saw the MR opened by Renovate to upgrade the library and I thought "easy Friday afternoon thing to do, hold my 🍺". 😄

annismckenzie added 2 commits September 26, 2025 15:10

Respect a requeue after if it’s set in case the error wraps ErrQuiet

00e5820

Signed-off-by: Daniel Lohse <info@asapdesign.de>

Honor the requeue after in the ResourceReconciler after updating the …

662d869

…status Signed-off-by: Daniel Lohse <info@asapdesign.de>

annismckenzie requested a review from scothis as a code owner September 26, 2025 13:11

annismckenzie commented Sep 26, 2025

View reviewed changes

annismckenzie closed this Sep 29, 2025

annismckenzie reopened this Sep 29, 2025

annismckenzie mentioned this pull request Sep 30, 2025

Respect a requeue after if it’s set in case the error wraps ErrQuiet alfatraining/runtime#3

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More test failures with latest release#653

More test failures with latest release#653
annismckenzie wants to merge 2 commits intoreconcilerio:mainfrom
alfatraining:0-expected-result-test-failures-with-latest-release

annismckenzie commented Sep 26, 2025

Uh oh!

annismckenzie Sep 26, 2025

Uh oh!

annismckenzie Sep 29, 2025

Uh oh!

Uh oh!

codecov bot commented Sep 27, 2025 •

edited

Loading

Uh oh!

scothis commented Sep 27, 2025

Uh oh!

annismckenzie commented Sep 28, 2025

Uh oh!

annismckenzie commented Sep 29, 2025

Uh oh!

annismckenzie commented Sep 29, 2025

Uh oh!

scothis commented Sep 29, 2025

Uh oh!

annismckenzie commented Sep 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

annismckenzie commented Sep 26, 2025

Uh oh!

annismckenzie Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

annismckenzie Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Sep 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

scothis commented Sep 27, 2025

Uh oh!

annismckenzie commented Sep 28, 2025

Uh oh!

annismckenzie commented Sep 29, 2025

Uh oh!

annismckenzie commented Sep 29, 2025

Uh oh!

scothis commented Sep 29, 2025

Uh oh!

annismckenzie commented Sep 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Sep 27, 2025 •

edited

Loading