-
Notifications
You must be signed in to change notification settings - Fork 25
feat(monitor): add number_gte support with tiered backoff to allocation queries #912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
e41e52f to
043ba9d
Compare
e1be806 to
9f38b12
Compare
Pull Request Test Coverage Report for Build 21385200531Details
💛 - Coveralls |
9f38b12 to
5123b0a
Compare
Add a tiered backoff strategy to handle Gateway routing to indexers with varying sync states. When queries fail due to "block too high" errors, the backoff progressively relaxes the number_gte constraint: - Tier 0: exact block (no backoff) - Tier 1: -100 blocks (~25 sec on Arbitrum) - Tier 2: -1,000 blocks (~4 min on Arbitrum) - Tier 3: -10,000 blocks (~40 min on Arbitrum) - Tier 4+: use max(latest) from Gateway error Includes parsing of Gateway error messages to extract available block numbers as a final fallback. Co-Authored-By: Claude Opus 4.5 <[email protected]>
Update indexer_allocations to use TieredBlockBackoff for handling Gateway routing to indexers with varying sync states: - Add number_gte parameter to get_allocations_with_block_state - Track block number from successful responses - Use number_gte constraint on subsequent queries to ensure freshness - Progress through backoff tiers on "block too high" errors - Preserve existing allocation data when receiving empty results The backoff logic is implemented directly in the allocations module for simplicity, avoiding unnecessary abstractions in the watcher crate. Co-Authored-By: Claude Opus 4.5 <[email protected]>
Apply the same number_gte with tiered backoff pattern to the closed allocations query in sender_account.rs. This ensures consistency with allocation queries and handles Gateway routing to indexers with varying sync states. Changes: - Add retry loop with TieredBlockBackoff for check_closed_allocations - Extract query logic to query_closed_allocations helper - Use number_gte constraint on initial query, hash for pagination - Log backoff tier progression on block-too-high errors Co-Authored-By: Claude Opus 4.5 <[email protected]>
Co-Authored-By: Claude Opus 4.5 <[email protected]>
…query Fixes #911 Tokio's time::interval fires immediately on the first tick, causing a redundant second query right after the initial query. This could overwrite valid data if the second query fails or returns stale results. Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Remove 10,000 block tier (40 min too aggressive, 4 min sufficient) - Add ARBITRUM_MIN_EXPECTED_BLOCK (400M) sanity check on initial query - Retry up to 3 times if initial block is suspiciously low - Change closed allocations retry to 2 attempts per tier before advancing Co-Authored-By: Claude Opus 4.5 <[email protected]>
4a94fe2 to
5dc4539
Compare
|
If you have time for a review @suchapalaver |
neithanmo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I understand the intent to handle Gateway routing to out-of-sync indexers, this PR has issues that block merging:
- Hardcoded
ARBITRUM_MIN_EXPECTED_BLOCK = 400_000_000- Breaks on Base, Sepolia, local dev, and any non-Arbitrum-mainnet chain. - Brittle error parsing - String matching on "missing block" / "latest: \d+" is not a stable contract. Gateway error formats can change without notice.
- The check "empty = stale" logic - The
should_updatecheck rejects empty results even from fresh indexers? this needs to be double check tho. If an indexer legitimately closes all allocations, the watcher will preserve stale data indefinitely.
On the broader approach: The current behavior without this PR is recoverable - queries fail, get logged, and retry until Gateway routes to a synced indexer.
We do think this issue must be addressed on the gateway side, because this problem is not severe enough to warrant a client-side workaround.
@MoonBoi9001 We are investigating the Gateway side to figure out what is the best approach.
Thanks for digging into this and putting together a detailed solution - the tiered backoff concept is solid thinking. Appreciate the effort to improve reliability here! 🙏
| const MAX_RETRIES: usize = 3; | ||
| for attempt in 1..=MAX_RETRIES { | ||
| match initial_result.block_number { | ||
| Some(block) if block >= ARBITRUM_MIN_EXPECTED_BLOCK => break, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This constant is kind of magic number and is problematic because:
- It's Arbitrum-specific but the code is meant to be generic
- It will be wrong for other chains like Base, Base sepolia, local testnets we use for testing and so on.
- an option is to make this value configurable but move the burden of configuring minimum block number threshold to indexer operators
|
|
||
| /// Checks if an error message indicates a "block too high" / "missing block" error | ||
| /// from the Gateway. | ||
| pub fn is_block_too_high_error(error: &str) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is brittle because:
- Error message format is not a contract - Gateway can change it anytime, we do not control it.
- we would need integration test again gateway latest version to ensure this works.
If Gateway changes "missing block" to "block unavailable" or "latest: 123" to "max_block: 123", this silently stops working and falls back to... what?
| let mut time_interval = time::interval(interval); | ||
| time_interval.set_missed_tick_behavior(time::MissedTickBehavior::Skip); | ||
| // Consume the immediate first tick to avoid redundant query at startup | ||
| time_interval.tick().await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change affects ALL watchers using new_watcher, not just allocations which this feature is trying to fix. While it prevents double-querying at startup, it could affect other watchers' expected behavior?
| let (should_update, current_count) = { | ||
| let current = tx.borrow(); | ||
| let should_update = | ||
| !query_result.allocations.is_empty() || current.is_empty(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is confusing, only update if result is non-empty OR we have nothing yet. what happen if an indexer is up-to date but does not have opened allocations at this particular time?
True, however there's an issue here that indexers aren't able to serve queries for their subgraphs when their allocation data is served back to them incorrectly, since their indexer-rs beleives that any receipts it receives from the gateway are not correct, so it will not accept them and respond to incomming queries. Indexers that run their own version of the network subgraph can bypass this, but for those that don't it appears to us when we look inside bigquery that they have a really low success rate. Also this affects their score as determined by both the ISA for query time selection and IISA for DIPS, the ISA will start to deprioritise them affecting how much revenue they can generate from serving queries, since from it's perspective it sees an indexer that cannot respond to queries it sends.
Okay I was not aware that the protocol also lives on base & Base sepolia and those chains are being used for testing, makes sense that ARBITRUM_MIN_EXPECTED_BLOCK should not be hard coded, but also forgot that this would have issues with arb sepolia & local dev env.
Appreciate the review & other comments here, I would be happy to address all review feedback and put this PR in draft for now until that's done, but do you think from your earlier comment "this issue must be addressed on the gateway side" that this repo is not the right place for such a PR then? |
|
Just seen the thread about this on slack. Didn't know this was beind disucssed already. Looks like I independently found the same bug that was being reported. |
Summary
number_gteparameter to allocation queries to ensure responses come from indexers synced to at least the last successful block - meaning the block height of the last respond we got from the gateway.Problem
The Graph Gateway routes queries to different indexers with varying sync states. Without block freshness constraints, a query might be served by a stale indexer that returns 0 allocations (because it hasn't indexed them yet), overwriting valid cached data.
Solution
Track the block number from successful responses and use
number_gteto constrain subsequent queries. When queries fail due to indexers being behind, progressively relax the constraint through tiers:🤖 Generated with Claude Code