Skip to content

More test failures with latest release#653

Open
annismckenzie wants to merge 2 commits intoreconcilerio:mainfrom
alfatraining:0-expected-result-test-failures-with-latest-release
Open

More test failures with latest release#653
annismckenzie wants to merge 2 commits intoreconcilerio:mainfrom
alfatraining:0-expected-result-test-failures-with-latest-release

Conversation

@annismckenzie
Copy link
Contributor

After upgrading to the latest release, more test failures surfaced in our test suites. I tracked those down to changes between v0.23.0 and v.0.24.0.

I'll add some comments in the code as well.

Signed-off-by: Daniel Lohse <info@asapdesign.de>
…status

Signed-off-by: Daniel Lohse <info@asapdesign.de>
result, err := r.AfterReconcile(ctx, req, AggregateResults(beforeResult, reconcileResult), err)
if errors.Is(err, ErrQuiet) {
// suppress error, while forcing a requeue
if result.RequeueAfter > 0 { // honor requeue after returned by reconciler
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be okay to add.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a use case below and I really think this is worth to add.

@codecov
Copy link

codecov bot commented Sep 27, 2025

Codecov Report

❌ Patch coverage is 83.33333% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 60.08%. Comparing base (ec87995) to head (662d869).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
reconcilers/resource.go 83.33% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #653      +/-   ##
==========================================
+ Coverage   57.28%   60.08%   +2.79%     
==========================================
  Files          39       39              
  Lines        4507     4587      +80     
==========================================
+ Hits         2582     2756     +174     
+ Misses       1811     1723      -88     
+ Partials      114      108       -6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@scothis
Copy link
Member

scothis commented Sep 27, 2025

@annismckenzie thanks for raising this.

I'm willing to be convinced otherwise, but my inclination is that the current behavior is correct. The newly failing tests in your project may be enforcing undesirable behavior. If you're willing/able to share an example we can work through it.

For the workqueue managing requests on the reconciler, returning an error or Requeue=true are equivalent. Both will cause that request to be reprocessed "immediately" (after an increasing backoff). The intent behind quiet errors is treat them like an error from the lifecycle's perspective, but with less noise in the logs.

RequeueAfter introduces an independent delay, which is ignored if an error is returned. So by returning a result with RequeueAfter for a quiet error is changing the runtime semantics.

After a status update (or any mutation of a resource) the copy of that resource in the informer cache is stale. Reprocessing the same resource will cause the same mutation to be attempted and fail server side, wasting resources and creating noise in the logs. Waiting for the informer to receive the updated resource will immediately queue that resource for processing. So even if the workqueue had a scheduled requeue for that resource, it would be preempted by the observed update. By propagating ReuqueAfter when we made a mutation no meaningful work can result. It will either reprocess the stale resource, or be preempted in the queue by the refreshed resource.

@annismckenzie
Copy link
Contributor Author

I've been thinking about this for some time, even while trying to fix the test failures and I agree with your assessment. Can we keep this open for a few days so I can take a look at the test suite again and rework the assertions? Then I'll close it.

@annismckenzie
Copy link
Contributor Author

Fixed our test suite – closing.

@annismckenzie
Copy link
Contributor Author

@scothis: sorry, need to ask one more question. Can you take a look at the following sub reconciler?

const waitDuration = time.Minute * 3

// WaitForUnhealthyHealthCheckSubReconciler is a sub reconciler that waits for the health check created by the previous sub reconciler to become unhealthy.
// To do this, it first waits for the appropriate amount of time before making sure via the Route 53 API. It will abort and requeue in 1 minute if these
// prerequisites haven't been met yet.
func WaitForUnhealthyHealthCheckSubReconciler(clusterResourceStasher ClusterResourceStash) reconcilers.SubReconciler[*operatorsv1alpha1.DownstreamRecord] {
  return &reconcilers.SyncReconciler[*operatorsv1alpha1.DownstreamRecord]{
    SyncWithResult: func(ctx context.Context, resource *operatorsv1alpha1.DownstreamRecord) (reconcilers.Result, error) {
      cluster := clusterResourceStasher.RetrieveOrDie(ctx)
      cm := resource.Status.ConditionManager(ctx)

      conditionHealthCheckCreated := cm.GetCondition(operatorsv1alpha1.ConditionHealthCheckCreated)
      waitTime := conditionHealthCheckCreated.LastTransitionTime.Add(waitDuration)
      now := rtime.RetrieveNow(ctx).UTC()
      if apis.ConditionIsTrue(conditionHealthCheckCreated) && now.After(waitTime) { // we waited long enough, i.e. the current time has reached the wait time
        // in order to make sure that the health check is unhealthy we have to call the Route 53 API
        …
      }

      log.FromContext(ctx).Info("waiting for health check aggregate state to be unhealthy",
        "check", fmt.Sprintf("'%s' < '%s' < '%s' (created/now/waiting until)", conditionHealthCheckCreated.LastTransitionTime, now, waitTime))
      return reconcilers.Result{RequeueAfter: time.Minute}, reconcilers.ErrHaltSubReconcilers // stop chain of sub reconcilers and requeue in a minute
    },
  }
}

I added a separate unit test for this but it's failing:

=== RUN   TestDownstreamRecordReconciler
=== RUN   TestDownstreamRecordReconciler/requeues_when_the_timeout_hasn't_been_reached_yet
    /Users/annismckenzie/git/go/src/git.alfa.sx/alfatraining/ops/services/cluster-operator-new/internal/controller/operators/downstreamrecord.go:272: DownstreamRecordResourceReconciler/WithFinalizer/4/: "level"=0 "msg"="waiting for health check aggregate state to be unhealthy" "resourceType"="operators.alfaview.com/v1alpha1, Kind=DownstreamRecord" "check"="'2025-09-29 13:53:34 +0000 UTC' < '2025-09-29 13:54:34.427286 +0000 UTC' < '2025-09-29 13:56:34 +0000 UTC' (created/now/waiting until)"
    /Users/annismckenzie/git/go/src/git.alfa.sx/alfatraining/ops/services/cluster-operator-new/internal/controller/operators/downstreamrecord_controller_test.go:337: ExpectedResult differs (-expected, +actual):   reconcile.Result{
        - 	Requeue:      false,
        + 	Requeue:      true,
        - 	RequeueAfter: s"1m0s",
        + 	RequeueAfter: s"0s",
          }

I hope the code is readable, I removed everything that wasn't relevant. Here's a quick rundown of what I'm trying to achieve:

  1. We're creating a health check in a previous sub reconciler. In order to continue we need to query an API for its status but that only makes sense after at least 3 minutes have passed.
  2. The creation time of that health check is tracked in a condition.
  3. If the 3 minutes haven't been reached yet, we just return with:
return reconcilers.Result{RequeueAfter: time.Minute}, reconcilers.ErrHaltSubReconcilers // stop chain of sub reconcilers and requeue in a minute

I did make sure to check that no status updates or the like are triggered in the unit test which would cause another requeue – for reference, here's the test:

"requeues when the timeout hasn't been reached yet": {
	Request: testRequest,
	GivenObjects: []client.Object{
		…
	},
	Metadata: map[string]interface{}{
		…
	},
	StatusSubResourceTypes: []client.Object{&operatorsv1alpha1.DownstreamRecord{}},
	ExpectTracks: []rtesting.TrackRequest{
		…
	},
	ExpectedResult: reconcile.Result{
		RequeueAfter: time.Minute,
	},
},

As it is right now, this would requeue immediately and then be throttled eventually, while incurring a lot of useless loops. I thought about using Advice but that won't help here – I mean, I could use time.Sleep(time.Second*60) in the Around case but I really don't want to do that. I need to somehow stop the rest of the subreconcilers in the Sequence which is why I'm using ErrHaltSubReconcilers. That also wraps ErrQuiet. This would be fixed by the clause I added in resource.go:

result, err := r.AfterReconcile(ctx, req, AggregateResults(beforeResult, reconcileResult), err)
if errors.Is(err, ErrQuiet) {
	// suppress error, while forcing a requeue
	if result.RequeueAfter > 0 { // honor requeue after returned by reconciler
		return result, nil
	}
	return Result{Requeue: true}, nil
}
return result, err

@annismckenzie annismckenzie reopened this Sep 29, 2025
@scothis
Copy link
Member

scothis commented Sep 29, 2025

@annismckenzie that's a good example that breaks the current model. I assume there is no effective way to watch the remote API for changes like you would a k8s resource.

Minimally, wrapping the check and subsequent reconcilers in an IfThen should unblock you. Let me think a bit if there is a more elegant way to approach this.

@annismckenzie
Copy link
Contributor Author

No, it's AWS Health Checks via Route 53. No need to rush, I'm not blocked. This all started on Friday when I saw the MR opened by Renovate to upgrade the library and I thought "easy Friday afternoon thing to do, hold my 🍺". 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants