Skip to content

fix: deflake //rs/tests/idx:ii_delegation_test#9305

Open
basvandijk wants to merge 10 commits intomasterfrom
ai/deflake-ii_delegation_test-2026-03-10
Open

fix: deflake //rs/tests/idx:ii_delegation_test#9305
basvandijk wants to merge 10 commits intomasterfrom
ai/deflake-ii_delegation_test-2026-03-10

Conversation

@basvandijk
Copy link
Collaborator

@basvandijk basvandijk commented Mar 10, 2026

//rs/tests/idx:ii_delegation_test flaked twice in the last week in the same way:

$ bazel run //ci/githubstats:query -- last --flaky //rs/tests/idx:ii_delegation_test --columns=last_started_at,errors
...
╒════╤═════════════════════════╤════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╕
│    │   last started at (UTC) │ errors per attempt                                                                                                 │
╞════╪═════════════════════════╪════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│  0 │ Fri 2026-03-06 16:47:11 │ 1: test: Minimum expected counter value 1 on counter canister was not observed after 10 r...                       │
├────┼─────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
...
├────┼─────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│  2 │ Wed 2026-03-04 16:43:08 │ 1: test: Minimum expected counter value 1 on counter canister was not observed after 10 r...                       │
╘════╧═════════════════════════╧════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╛

Root Cause

The test was flaky because the update call to increment the counter canister used update() which fires the HTTP request but does NOT wait for the update to be processed. The test then immediately tried to read the counter value, but since the update hadn't been executed yet, the counter was still 0. The retry loop (10 retries, 1 second each) sometimes wasn't enough time for the update to be processed on the application subnet.

Fix

Use update_and_wait() instead of update() to properly wait for the update to be processed before checking the counter value. This also lets us directly verify the counter value from the delegated query response, removing the need for the separate retry-based assertion (assert_canister_counter_with_retries).

Note that the test already used update_and_wait() for the universal canister call later in the test (line 164), making this fix consistent with the existing pattern.

We also re-implement the retry-loop in update_and_wait() with ic_system_test_driver::retry_with_msg_async! such that it has proper logging and uses a backoff duration in case it fails.


This PR was created following the steps in .claude/skills/fix-flaky-tests/SKILL.md.

The test was flaky because the update call to increment the counter canister used
`update()` which fires the request but does NOT wait for it to be processed.
The test then immediately tried to read the counter value, but since the update
hadn't been executed yet, the counter was still 0. The retry loop (10 retries,
1 second each) sometimes wasn't enough for the update to be processed.

Fix: Use `update_and_wait()` instead of `update()` to wait for the update to
be processed before asserting the counter value. This also lets us directly
verify the counter value from the delegated query response, removing the need
for the separate retry-based assertion.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses flakiness in the ii_delegation_test system test by ensuring delegated update calls actually complete before the test asserts on the resulting counter state.

Changes:

  • Replaced the fire-and-forget delegated update() with update_and_wait() for the counter increment.
  • Removed the retry-based counter assertion helper and directly asserted the delegated query response value.
  • Cleaned up unused imports related to the removed retry logic.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Address review comment: update_and_wait() polls up to polling_timeout,
which was set to 10s via UPDATE_POLLING_TIMEOUT. If the app-subnet update
occasionally takes >10s to execute, the test can still intermittently fail
with a polling timeout. Increase to 60s to reliably wait long enough.
@basvandijk basvandijk requested a review from Copilot March 10, 2026 23:36
@basvandijk basvandijk marked this pull request as ready for review March 10, 2026 23:38
@basvandijk basvandijk requested a review from a team as a code owner March 10, 2026 23:38
@github-actions github-actions bot added the @idx label Mar 10, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@cgundy cgundy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Copy link
Contributor

@pierugo-dfinity pierugo-dfinity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving single line Consensus-owned change

Instead of only polling for the reply leaf, first check
request_status/<id>/status to detect terminal states early.
Returns immediately with a descriptive error on 'rejected' (including
reject code and message) or 'done', and only reads the reply leaf
when the status is 'replied'.
The 'rejected' and 'done' states are terminal — retrying them via the
consistent with other terminal error handling in update_and_wait.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

for terminal states (rejected, done, malformed reply). Uses a nested
Result pattern (anyhow::Result<Result<Vec<u8>, String>>) so that:
- Ok(Err(...)) stops retrying immediately for terminal failures
- Ok(Ok(...)) returns success

Error messages now include canister_id, method_name, and request_id
for better debuggability. This makes the function composable so
callers can handle errors instead of unwinding on panic.
@basvandijk basvandijk requested a review from Copilot March 11, 2026 11:37
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…sync closure

Inside the retry_with_msg_async! polling closure in update_and_wait,
several .unwrap()/.expect() calls could panic and bypass the retry
logic entirely. Convert them to ? with anyhow::Context so failures
are logged and retried instead of crashing the test process.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ling

The read_state status 'pruned' is a terminal state in the IC spec, but
it was falling into the default retry arm, causing pointless retries
until polling_timeout. Treat it as terminal (like 'done') and return an
immediate error.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ry loops

Change send_http_request() to return anyhow::Result<Response> instead of
panicking on transport failures. This ensures that transient errors inside
retry_with_msg_async! are retried cleanly rather than bypassing the retry
loop via a panic.

Update the signatures of update() and query() accordingly, and adjust all
call sites to handle the new Result return type.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +191 to +194
let _ = self
.send_http_request("call", canister_id, body)
.await
.context("failed to send update call")?;
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update() ignores the HTTP response status from the call endpoint. If the replica returns a non-success status (e.g., 4xx/5xx), the code will still return a request_id and then update_and_wait may poll until timeout with a misleading error. Consider failing fast by checking the HTTP status (e.g., via error_for_status or an explicit status check) before returning Ok(request_id).

Suggested change
let _ = self
.send_http_request("call", canister_id, body)
.await
.context("failed to send update call")?;
let response = self
.send_http_request("call", canister_id, body)
.await
.context("failed to send update call")?;
if !response.status().is_success() {
bail!("update call failed with HTTP status: {}", response.status());
}

Copilot uses AI. Check for mistakes.
Comment on lines 187 to +191
sender_pubkey: Some(Blob(self.pubkey.clone().into_vec())),
sender_sig: Some(Blob(signature.signature.unwrap())),
};
let body = serde_cbor::ser::to_vec(&envelope).unwrap();
let _ = self.send_http_request("call", canister_id, body).await;
request_id
let _ = self
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that update() returns anyhow::Result, serde_cbor::ser::to_vec(&envelope).unwrap() can still panic and bypass the structured error handling you added. Prefer propagating the serialization failure as an error with context instead of unwrapping.

Copilot uses AI. Check for mistakes.
@@ -213,9 +225,15 @@
sender_sig: Some(Blob(signature.signature.unwrap())),
};
let body = serde_cbor::ser::to_vec(&envelope).unwrap();
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

query() returns anyhow::Result, but serde_cbor::ser::to_vec(&envelope).unwrap() can still panic and bypass the new error handling path. Prefer returning a contextual error instead of unwrapping.

Suggested change
let body = serde_cbor::ser::to_vec(&envelope).unwrap();
let body = serde_cbor::ser::to_vec(&envelope)
.context("failed to serialize query envelope")?;

Copilot uses AI. Check for mistakes.
.send()
.await
.unwrap()
.context(format!("failed to send {method} request to {canister_id}"))
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

send_http_request builds the context string eagerly via context(format!(...)), which allocates even on the success path and is called frequently during update_and_wait polling. Prefer with_context(|| format!(...)) (or equivalent) so the formatting only happens when there is an error.

Suggested change
.context(format!("failed to send {method} request to {canister_id}"))
.with_context(|| format!("failed to send {method} request to {canister_id}"))

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants