fix: deflake //rs/tests/idx:ii_delegation_test#9305
fix: deflake //rs/tests/idx:ii_delegation_test#9305basvandijk wants to merge 10 commits intomasterfrom
Conversation
The test was flaky because the update call to increment the counter canister used `update()` which fires the request but does NOT wait for it to be processed. The test then immediately tried to read the counter value, but since the update hadn't been executed yet, the counter was still 0. The retry loop (10 retries, 1 second each) sometimes wasn't enough for the update to be processed. Fix: Use `update_and_wait()` instead of `update()` to wait for the update to be processed before asserting the counter value. This also lets us directly verify the counter value from the delegated query response, removing the need for the separate retry-based assertion.
There was a problem hiding this comment.
Pull request overview
This PR addresses flakiness in the ii_delegation_test system test by ensuring delegated update calls actually complete before the test asserts on the resulting counter state.
Changes:
- Replaced the fire-and-forget delegated
update()withupdate_and_wait()for the counter increment. - Removed the retry-based counter assertion helper and directly asserted the delegated query response value.
- Cleaned up unused imports related to the removed retry logic.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Address review comment: update_and_wait() polls up to polling_timeout, which was set to 10s via UPDATE_POLLING_TIMEOUT. If the app-subnet update occasionally takes >10s to execute, the test can still intermittently fail with a polling timeout. Increase to 60s to reliably wait long enough.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
… with retry_with_msg_async!
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
pierugo-dfinity
left a comment
There was a problem hiding this comment.
Approving single line Consensus-owned change
Instead of only polling for the reply leaf, first check request_status/<id>/status to detect terminal states early. Returns immediately with a descriptive error on 'rejected' (including reject code and message) or 'done', and only reads the reply leaf when the status is 'replied'.
The 'rejected' and 'done' states are terminal — retrying them via the consistent with other terminal error handling in update_and_wait.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
for terminal states (rejected, done, malformed reply). Uses a nested Result pattern (anyhow::Result<Result<Vec<u8>, String>>) so that: - Ok(Err(...)) stops retrying immediately for terminal failures - Ok(Ok(...)) returns success Error messages now include canister_id, method_name, and request_id for better debuggability. This makes the function composable so callers can handle errors instead of unwinding on panic.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…sync closure Inside the retry_with_msg_async! polling closure in update_and_wait, several .unwrap()/.expect() calls could panic and bypass the retry logic entirely. Convert them to ? with anyhow::Context so failures are logged and retried instead of crashing the test process.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ling The read_state status 'pruned' is a terminal state in the IC spec, but it was falling into the default retry arm, causing pointless retries until polling_timeout. Treat it as terminal (like 'done') and return an immediate error.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…mplex for this code
…ry loops Change send_http_request() to return anyhow::Result<Response> instead of panicking on transport failures. This ensures that transient errors inside retry_with_msg_async! are retried cleanly rather than bypassing the retry loop via a panic. Update the signatures of update() and query() accordingly, and adjust all call sites to handle the new Result return type.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let _ = self | ||
| .send_http_request("call", canister_id, body) | ||
| .await | ||
| .context("failed to send update call")?; |
There was a problem hiding this comment.
update() ignores the HTTP response status from the call endpoint. If the replica returns a non-success status (e.g., 4xx/5xx), the code will still return a request_id and then update_and_wait may poll until timeout with a misleading error. Consider failing fast by checking the HTTP status (e.g., via error_for_status or an explicit status check) before returning Ok(request_id).
| let _ = self | |
| .send_http_request("call", canister_id, body) | |
| .await | |
| .context("failed to send update call")?; | |
| let response = self | |
| .send_http_request("call", canister_id, body) | |
| .await | |
| .context("failed to send update call")?; | |
| if !response.status().is_success() { | |
| bail!("update call failed with HTTP status: {}", response.status()); | |
| } |
| sender_pubkey: Some(Blob(self.pubkey.clone().into_vec())), | ||
| sender_sig: Some(Blob(signature.signature.unwrap())), | ||
| }; | ||
| let body = serde_cbor::ser::to_vec(&envelope).unwrap(); | ||
| let _ = self.send_http_request("call", canister_id, body).await; | ||
| request_id | ||
| let _ = self |
There was a problem hiding this comment.
Now that update() returns anyhow::Result, serde_cbor::ser::to_vec(&envelope).unwrap() can still panic and bypass the structured error handling you added. Prefer propagating the serialization failure as an error with context instead of unwrapping.
| @@ -213,9 +225,15 @@ | |||
| sender_sig: Some(Blob(signature.signature.unwrap())), | |||
| }; | |||
| let body = serde_cbor::ser::to_vec(&envelope).unwrap(); | |||
There was a problem hiding this comment.
query() returns anyhow::Result, but serde_cbor::ser::to_vec(&envelope).unwrap() can still panic and bypass the new error handling path. Prefer returning a contextual error instead of unwrapping.
| let body = serde_cbor::ser::to_vec(&envelope).unwrap(); | |
| let body = serde_cbor::ser::to_vec(&envelope) | |
| .context("failed to serialize query envelope")?; |
| .send() | ||
| .await | ||
| .unwrap() | ||
| .context(format!("failed to send {method} request to {canister_id}")) |
There was a problem hiding this comment.
send_http_request builds the context string eagerly via context(format!(...)), which allocates even on the success path and is called frequently during update_and_wait polling. Prefer with_context(|| format!(...)) (or equivalent) so the formatting only happens when there is an error.
| .context(format!("failed to send {method} request to {canister_id}")) | |
| .with_context(|| format!("failed to send {method} request to {canister_id}")) |
//rs/tests/idx:ii_delegation_testflaked twice in the last week in the same way:Root Cause
The test was flaky because the update call to increment the counter canister used
update()which fires the HTTP request but does NOT wait for the update to be processed. The test then immediately tried to read the counter value, but since the update hadn't been executed yet, the counter was still 0. The retry loop (10 retries, 1 second each) sometimes wasn't enough time for the update to be processed on the application subnet.Fix
Use
update_and_wait()instead ofupdate()to properly wait for the update to be processed before checking the counter value. This also lets us directly verify the counter value from the delegated query response, removing the need for the separate retry-based assertion (assert_canister_counter_with_retries).Note that the test already used
update_and_wait()for the universal canister call later in the test (line 164), making this fix consistent with the existing pattern.We also re-implement the retry-loop in
update_and_wait()withic_system_test_driver::retry_with_msg_async!such that it has proper logging and uses a backoff duration in case it fails.This PR was created following the steps in
.claude/skills/fix-flaky-tests/SKILL.md.