Skip to content

Conversation

@mociarain
Copy link
Collaborator

I observed several example where our E2E tests were failing due to this test. After a little pairing we came to the conclusion that a likely cause is the Update made to the cluster CR from the reconciler failing. We believe this fails because:

  • Reconciler starts =>the cluster CR is fetched
  • Subnets are fetched, parsed and decisions are made
  • At the end we attempt to Update be the CR has been updated and we get the classic Object Modified error (not confirmed)

This PR moves to a Patch rather than an update which should avoid this problem.

On a wider note the only reason I can see that we add the annotation to the CR is so it can be picked up by the monitor and written as a metric... This seems brittle and I'd be in favour of just writing to the in-cluster prometheus from the ARO-Operator directly assuming https://github.com/openshift-online/architecture/pull/57 is accepted @alcasim in particular.

mociarain and others added 6 commits January 29, 2026 09:35
The original code used r.instance which was fetched once at reconciliation
start. If another process modified the Cluster resource during the
reconciliation loop, the patch would fail with a conflict error (HTTP 409)
due to stale resourceVersion.

This change wraps the annotation update in retry.RetryOnConflict, which is
the standard Kubernetes pattern for optimistic concurrency control. Each
retry iteration fetches a fresh copy of the Cluster resource, ensuring we
have the latest resourceVersion before patching.

Trade-offs:
- Extra API call: Each retry does a Get before Patch, adding latency but
  ensuring correctness
- Retry limits: retry.DefaultRetry uses exponential backoff with ~5 retries;
  in extreme contention scenarios this could still fail
- No r.instance sync: The in-memory r.instance is not updated, which is fine
  since it's only used for reading during reconciliation and the annotation
  update is terminal

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants