fix: add statement_timeout and lock_timeout to mirror_primary_tables …#6446
Open
suntzu93 wants to merge 1 commit intographprotocol:masterfrom
Open
fix: add statement_timeout and lock_timeout to mirror_primary_tables …#6446suntzu93 wants to merge 1 commit intographprotocol:masterfrom
suntzu93 wants to merge 1 commit intographprotocol:masterfrom
Conversation
lutter
approved these changes
Mar 20, 2026
store/postgres/src/primary.rs
Outdated
| conn.batch_execute("SET LOCAL statement_timeout = '60s'; SET LOCAL lock_timeout = '10s';") | ||
| .await | ||
| .map_err(StoreError::from)?; | ||
|
|
Collaborator
There was a problem hiding this comment.
One minor nit: could you move this code down below the declaration of copy_table, i.e., just before the comment // Truncate all tables at once? Mixing code and fn declarations makes this a little hard to read
…to prevent FDW zombie locks Add SET LOCAL timeouts to refresh_tables() to prevent FDW zombie connections from holding ACCESS EXCLUSIVE locks on mirrored tables indefinitely. This fixes a v0.42.0 regression where stuck FDW queries during mirror_primary_tables() cascade into full system lock. - statement_timeout=60s: kills stuck FDW queries (normally <5s) - lock_timeout=10s: prevents TRUNCATE stampede from multiple nodes Both are SET LOCAL (transaction-scoped), zero impact on other queries.
7066bf4 to
de4574f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix: Add statement_timeout and lock_timeout to mirror_primary_tables to prevent FDW zombie locks
Problem
In multi-shard deployments running v0.42.0, Graph Node periodically hangs with connection checkout timeouts across all shards:
This occurs when multiple sync nodes (with block ingestion enabled) are running in a multi-shard configuration. Single-shard deployments (no FDW) and read-only nodes (
DISABLE_BLOCK_INGESTOR=true) are not affected.Root Cause
Mirror::refresh_tables()is called every 15 minutes by the MirrorPrimary job, which is registered inside spawn_block_ingestor() →register_store_jobs(). This means it runs on every node that has block ingestion enabled (i.e., nodes withoutDISABLE_BLOCK_INGESTOR=true). It performs:TRUNCATEon 6 tables at once — acquires ACCESS EXCLUSIVE lock onchains,deployment_schemas,active_copies, subgraph, subgraph_version,subgraph_deployment_assignmentINSERT ... SELECT * FROM primary_public.{table}via Foreign Data Wrapper (FDW) to repopulate each tableIf the FDW connection to the primary becomes a zombie (connection dropped but shard hasn't detected it via TCP keepalive), the FDW INSERT hangs indefinitely on
PostgresFdwGetResultwhile holding ACCESS EXCLUSIVE locks on all 6 tables. This blocks every query touching these tables — including deployment_statuses(), deployment_sizes(), andMirror::read_async()— causing a lock cascade that exhausts the connection pool.In setups with multiple sync nodes (each per chain, none setting
DISABLE_BLOCK_INGESTOR), all of them run MirrorPrimary independently. This creates a TRUNCATE stampede when multiple nodes attempt refresh_tables() on the same shard concurrently — each queuing for the ACCESS EXCLUSIVE lock held by the stuck FDW query.Conditions to trigger
if self.shard == *PRIMARY_SHARD { return Ok(()); })DISABLE_BLOCK_INGESTOR=trueEven with a single ingestor node, condition 3 alone is sufficient to cause indefinite ACCESS EXCLUSIVE locks. Multiple ingestors amplify the impact through TRUNCATE stampede.
PostgreSQL Investigation
Three diagnostic queries were run on production to confirm the exact lock chain.
Query 1 — Blocking chain (which PID blocks which):
Result — all queries blocked by single PID 296302:
Mirror::read_async()readingpublic.chains— blockeddeployment_schemas— blockedQuery 2 — Lock details (
pg_locksjoined withpg_class):Result — PID 296302 holds
AccessExclusiveLock(granted=True) on ALL 6 tables:Query 3 — Primary database health (confirming FDW zombie):
Result — zero active FDW queries from the shard on the primary:
No FDW connection from the shard — confirming the connection is a zombie. The shard is waiting for a response from a connection that no longer exists on the primary.
Why v0.41.0 was not affected
v0.41.0 used
PoolInner::with_conn()which provided two protections removed in the diesel-async migration:limitersemaphore (capacity =pool_size) — serialized all pool operations including mirror_primary_tables(), preventing multiple nodes from issuing concurrent TRUNCATEsv0.42.0's mirror_primary_tables() uses
self.get().await?directly — no limiter, no cancel mechanism. A stuck FDW query holds ACCESS EXCLUSIVE locks indefinitely with no way to abort.Note: Even v0.41.0's CancelHandle could not interrupt a stuck FDW query mid-execution (it only checked between steps). The
SET LOCAL statement_timeoutfix is actually stronger — PostgreSQL kills the query at the database level regardless of execution state.Fix
Add
SET LOCALtimeouts in refresh_tables(), just before the TRUNCATE operation:lock_timeout = 10s: IfTRUNCATEcannot acquire locks within 10s (another node is mirroring), the transaction aborts instead of queuing. The job retries in 15 minutes.statement_timeout = 60s: If any FDW query (which normally completes in <5s) is stuck for 60s, PostgreSQL kills it. The transaction rolls back, releasing all locks.SET LOCAL: Scoped to this transaction only — zero impact on indexing, GraphQL queries, or any other operations.Impact
SET LOCALis metadata-only, no timers or polling