Skip to content

Conversation

@MoonBoi9001
Copy link
Member

@MoonBoi9001 MoonBoi9001 commented Jan 27, 2026

Summary

  • Adds number_gte parameter to allocation queries to ensure responses come from indexers synced to at least the last successful block - meaning the block height of the last respond we got from the gateway.
  • Implements tiered backoff strategy when Gateway returns "block too high" errors (when indexers have not synced to at least the block height of the previous response)
  • This RP effectively protects against overwriting valid allocation data with empty results from stale indexers that are serving old data.

Problem

The Graph Gateway routes queries to different indexers with varying sync states. Without block freshness constraints, a query might be served by a stale indexer that returns 0 allocations (because it hasn't indexed them yet), overwriting valid cached data.

Solution

Track the block number from successful responses and use number_gte to constrain subsequent queries. When queries fail due to indexers being behind, progressively relax the constraint through tiers:

Tier Relax the block height constraint by Equilavalent time on Arbitrum
0 0 blocks immediate
1 100 blocks ~25 sec
2 1,000 blocks ~4 min
3 10,000 blocks ~40 min
4+ Just use block height from error message last resort

🤖 Generated with Claude Code

@MoonBoi9001 MoonBoi9001 force-pushed the feat/number-gte-tiered-backoff branch from e41e52f to 043ba9d Compare January 27, 2026 04:41
@MoonBoi9001 MoonBoi9001 force-pushed the feat/number-gte-tiered-backoff branch 2 times, most recently from e1be806 to 9f38b12 Compare January 27, 2026 05:02
@github-actions
Copy link
Contributor

github-actions bot commented Jan 27, 2026

Pull Request Test Coverage Report for Build 21385200531

Details

  • 153 of 273 (56.04%) changed or added relevant lines in 4 files are covered.
  • 5 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-0.1%) to 68.088%

Changes Missing Coverage Covered Lines Changed/Added Lines %
crates/tap-agent/src/agent/sender_account.rs 29 51 56.86%
crates/monitor/src/allocations.rs 0 98 0.0%
Files with Coverage Reduction New Missed Lines %
crates/monitor/src/allocations.rs 2 0.0%
crates/watcher/src/lib.rs 3 82.47%
Totals Coverage Status
Change from base Build 21372283217: -0.1%
Covered Lines: 10427
Relevant Lines: 15314

💛 - Coveralls

@MoonBoi9001 MoonBoi9001 force-pushed the feat/number-gte-tiered-backoff branch from 9f38b12 to 5123b0a Compare January 27, 2026 15:06
MoonBoi9001 and others added 6 commits January 27, 2026 22:41
Add a tiered backoff strategy to handle Gateway routing to indexers with
varying sync states. When queries fail due to "block too high" errors,
the backoff progressively relaxes the number_gte constraint:

- Tier 0: exact block (no backoff)
- Tier 1: -100 blocks (~25 sec on Arbitrum)
- Tier 2: -1,000 blocks (~4 min on Arbitrum)
- Tier 3: -10,000 blocks (~40 min on Arbitrum)
- Tier 4+: use max(latest) from Gateway error

Includes parsing of Gateway error messages to extract available block
numbers as a final fallback.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Update indexer_allocations to use TieredBlockBackoff for handling Gateway
routing to indexers with varying sync states:

- Add number_gte parameter to get_allocations_with_block_state
- Track block number from successful responses
- Use number_gte constraint on subsequent queries to ensure freshness
- Progress through backoff tiers on "block too high" errors
- Preserve existing allocation data when receiving empty results

The backoff logic is implemented directly in the allocations module for
simplicity, avoiding unnecessary abstractions in the watcher crate.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Apply the same number_gte with tiered backoff pattern to the closed
allocations query in sender_account.rs. This ensures consistency with
allocation queries and handles Gateway routing to indexers with varying
sync states.

Changes:
- Add retry loop with TieredBlockBackoff for check_closed_allocations
- Extract query logic to query_closed_allocations helper
- Use number_gte constraint on initial query, hash for pagination
- Log backoff tier progression on block-too-high errors

Co-Authored-By: Claude Opus 4.5 <[email protected]>
…query

Fixes #911

Tokio's time::interval fires immediately on the first tick, causing a
redundant second query right after the initial query. This could
overwrite valid data if the second query fails or returns stale results.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Remove 10,000 block tier (40 min too aggressive, 4 min sufficient)
- Add ARBITRUM_MIN_EXPECTED_BLOCK (400M) sanity check on initial query
- Retry up to 3 times if initial block is suspiciously low
- Change closed allocations retry to 2 attempts per tier before advancing

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@MoonBoi9001 MoonBoi9001 force-pushed the feat/number-gte-tiered-backoff branch from 4a94fe2 to 5dc4539 Compare January 28, 2026 02:41
@MoonBoi9001 MoonBoi9001 marked this pull request as ready for review January 28, 2026 02:56
@MoonBoi9001
Copy link
Member Author

If you have time for a review @suchapalaver
Much appreciated!

Copy link
Collaborator

@neithanmo neithanmo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I understand the intent to handle Gateway routing to out-of-sync indexers, this PR has issues that block merging:

  1. Hardcoded ARBITRUM_MIN_EXPECTED_BLOCK = 400_000_000 - Breaks on Base, Sepolia, local dev, and any non-Arbitrum-mainnet chain.
  2. Brittle error parsing - String matching on "missing block" / "latest: \d+" is not a stable contract. Gateway error formats can change without notice.
  3. The check "empty = stale" logic - The should_update check rejects empty results even from fresh indexers? this needs to be double check tho. If an indexer legitimately closes all allocations, the watcher will preserve stale data indefinitely.

On the broader approach: The current behavior without this PR is recoverable - queries fail, get logged, and retry until Gateway routes to a synced indexer.

We do think this issue must be addressed on the gateway side, because this problem is not severe enough to warrant a client-side workaround.

@MoonBoi9001 We are investigating the Gateway side to figure out what is the best approach.

Thanks for digging into this and putting together a detailed solution - the tiered backoff concept is solid thinking. Appreciate the effort to improve reliability here! 🙏

const MAX_RETRIES: usize = 3;
for attempt in 1..=MAX_RETRIES {
match initial_result.block_number {
Some(block) if block >= ARBITRUM_MIN_EXPECTED_BLOCK => break,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constant is kind of magic number and is problematic because:

  • It's Arbitrum-specific but the code is meant to be generic
  • It will be wrong for other chains like Base, Base sepolia, local testnets we use for testing and so on.
  • an option is to make this value configurable but move the burden of configuring minimum block number threshold to indexer operators


/// Checks if an error message indicates a "block too high" / "missing block" error
/// from the Gateway.
pub fn is_block_too_high_error(error: &str) -> bool {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is brittle because:

  1. Error message format is not a contract - Gateway can change it anytime, we do not control it.
  2. we would need integration test again gateway latest version to ensure this works.

If Gateway changes "missing block" to "block unavailable" or "latest: 123" to "max_block: 123", this silently stops working and falls back to... what?

let mut time_interval = time::interval(interval);
time_interval.set_missed_tick_behavior(time::MissedTickBehavior::Skip);
// Consume the immediate first tick to avoid redundant query at startup
time_interval.tick().await;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change affects ALL watchers using new_watcher, not just allocations which this feature is trying to fix. While it prevents double-querying at startup, it could affect other watchers' expected behavior?

let (should_update, current_count) = {
let current = tx.borrow();
let should_update =
!query_result.allocations.is_empty() || current.is_empty();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is confusing, only update if result is non-empty OR we have nothing yet. what happen if an indexer is up-to date but does not have opened allocations at this particular time?

@MoonBoi9001
Copy link
Member Author

MoonBoi9001 commented Jan 29, 2026

The current behavior without this PR is recoverable - queries fail, get logged, and retry until Gateway routes to a synced indexer.

True, however there's an issue here that indexers aren't able to serve queries for their subgraphs when their allocation data is served back to them incorrectly, since their indexer-rs beleives that any receipts it receives from the gateway are not correct, so it will not accept them and respond to incomming queries. Indexers that run their own version of the network subgraph can bypass this, but for those that don't it appears to us when we look inside bigquery that they have a really low success rate. Also this affects their score as determined by both the ISA for query time selection and IISA for DIPS, the ISA will start to deprioritise them affecting how much revenue they can generate from serving queries, since from it's perspective it sees an indexer that cannot respond to queries it sends.

It will be wrong for other chains like Base, Base sepolia, local testnets we use for testing and so on.

Okay I was not aware that the protocol also lives on base & Base sepolia and those chains are being used for testing, makes sense that ARBITRUM_MIN_EXPECTED_BLOCK should not be hard coded, but also forgot that this would have issues with arb sepolia & local dev env.

Brittle error parsing - String matching on "missing block" / "latest: \d+" is not a stable contract. Gateway error formats can change without notice.
The check "empty = stale" logic - The should_update check rejects empty results even from fresh indexers? this needs to be double check tho. If an indexer legitimately closes all allocations, the watcher will preserve stale data indefinitely.

Appreciate the review & other comments here, I would be happy to address all review feedback and put this PR in draft for now until that's done, but do you think from your earlier comment "this issue must be addressed on the gateway side" that this repo is not the right place for such a PR then?

@MoonBoi9001
Copy link
Member Author

Just seen the thread about this on slack. Didn't know this was beind disucssed already. Looks like I independently found the same bug that was being reported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants