Skip to content

fix: add statement_timeout and lock_timeout to mirror_primary_tables …#6446

Open
suntzu93 wants to merge 1 commit intographprotocol:masterfrom
suntzu93:fix/fdw-zombie-lock-timeout
Open

fix: add statement_timeout and lock_timeout to mirror_primary_tables …#6446
suntzu93 wants to merge 1 commit intographprotocol:masterfrom
suntzu93:fix/fdw-zombie-lock-timeout

Conversation

@suntzu93
Copy link

@suntzu93 suntzu93 commented Mar 20, 2026

Fix: Add statement_timeout and lock_timeout to mirror_primary_tables to prevent FDW zombie locks

Problem

In multi-shard deployments running v0.42.0, Graph Node periodically hangs with connection checkout timeouts across all shards:

Connection checkout timed out, wait_ms: 30000, pool: main, shard: shard2
Connection checkout timed out, wait_ms: 30001, pool: main, shard: shard3
Connection checkout timed out, wait_ms: 30000, pool: main, shard: shard1

This occurs when multiple sync nodes (with block ingestion enabled) are running in a multi-shard configuration. Single-shard deployments (no FDW) and read-only nodes (DISABLE_BLOCK_INGESTOR=true) are not affected.

Root Cause

Mirror::refresh_tables() is called every 15 minutes by the MirrorPrimary job, which is registered inside spawn_block_ingestor() → register_store_jobs(). This means it runs on every node that has block ingestion enabled (i.e., nodes without DISABLE_BLOCK_INGESTOR=true). It performs:

  1. TRUNCATE on 6 tables at once — acquires ACCESS EXCLUSIVE lock on chains, deployment_schemas, active_copies, subgraph, subgraph_version, subgraph_deployment_assignment
  2. Sequential INSERT ... SELECT * FROM primary_public.{table} via Foreign Data Wrapper (FDW) to repopulate each table

If the FDW connection to the primary becomes a zombie (connection dropped but shard hasn't detected it via TCP keepalive), the FDW INSERT hangs indefinitely on PostgresFdwGetResult while holding ACCESS EXCLUSIVE locks on all 6 tables. This blocks every query touching these tables — including deployment_statuses(), deployment_sizes(), and Mirror::read_async() — causing a lock cascade that exhausts the connection pool.

In setups with multiple sync nodes (each per chain, none setting DISABLE_BLOCK_INGESTOR), all of them run MirrorPrimary independently. This creates a TRUNCATE stampede when multiple nodes attempt refresh_tables() on the same shard concurrently — each queuing for the ACCESS EXCLUSIVE lock held by the stuck FDW query.

Conditions to trigger

  1. Multi-shard setup — mirror_primary_tables() is a no-op on the primary shard (if self.shard == *PRIMARY_SHARD { return Ok(()); })
  2. Block ingestion enabled — MirrorPrimary job only runs on nodes without DISABLE_BLOCK_INGESTOR=true
  3. Unstable FDW connection — network issue between shard and primary causes zombie connection (shard waiting for response from a dead connection)

Even with a single ingestor node, condition 3 alone is sufficient to cause indefinite ACCESS EXCLUSIVE locks. Multiple ingestors amplify the impact through TRUNCATE stampede.

PostgreSQL Investigation

Three diagnostic queries were run on production to confirm the exact lock chain.

Query 1 — Blocking chain (which PID blocks which):

SELECT blocked.pid AS blocked_pid, blocked.query AS blocked_query,
       blocking.pid AS blocking_pid, blocking.query AS blocking_query,
       blocked.wait_event_type, now() - blocked.query_start AS blocked_duration
FROM pg_stat_activity blocked
JOIN pg_locks bl ON bl.pid = blocked.pid AND NOT bl.granted
JOIN pg_locks gl ON gl.database = bl.database AND gl.relation = bl.relation AND gl.pid != bl.pid AND gl.granted
JOIN pg_stat_activity blocking ON blocking.pid = gl.pid
WHERE blocked.state = 'active'
ORDER BY blocked_duration DESC;

Result — all queries blocked by single PID 296302:

blocked_pid | blocked_query                                  | blocking_pid | blocking_query                                                    | wait | blocked_duration
296662      | truncate table public.chains, public.deploy... | 296302       | insert into public.chains select * from primary_public.chains;    | Lock | 00:13:47.364265
296663      | truncate table public.chains, public.deploy... | 296302       | insert into public.chains select * from primary_public.chains;    | Lock | 00:13:47.312739
298104      | DECLARE c1 CURSOR FOR SELECT id, name, net_... | 296302       | insert into public.chains select * from primary_public.chains;    | Lock | 00:13:11.301664
298091      | SELECT ds.id, ss.row_estimate::bigint, ...     | 296302       | insert into public.chains select * from primary_public.chains;    | Lock | 00:13:03.780706
  • PIDs 296662, 296663: Other nodes' mirror_primary_tables() attempting TRUNCATE — blocked
  • PID 298104: Mirror::read_async() reading public.chains — blocked
  • PID 298091 (+ 20 more): deployment_sizes() querying deployment_schemas — blocked

Query 2 — Lock details (pg_locks joined with pg_class):

SELECT l.pid, l.locktype, l.mode, l.granted, c.relname AS table_name, n.nspname AS schema,
       a.state, left(a.query, 80) AS query, now() - a.query_start AS duration
FROM pg_locks l
JOIN pg_class c ON l.relation = c.oid
JOIN pg_namespace n ON c.relnamespace = n.oid
JOIN pg_stat_activity a ON l.pid = a.pid
WHERE n.nspname IN ('public', 'subgraphs', 'info')
  AND c.relname IN ('chains', 'deployment_schemas', 'active_copies', 'subgraph', 'subgraph_version', 'subgraph_deployment_assignment')
ORDER BY c.relname, l.granted DESC, duration DESC;

Result — PID 296302 holds AccessExclusiveLock (granted=True) on ALL 6 tables:

pid    | mode                | granted | table_name                       | schema    | query                                                           | duration
296302 | AccessExclusiveLock | True    | chains                           | public    | insert into public.chains select * from primary_public.chains;  | 00:13:39
296302 | RowExclusiveLock    | True    | chains                           | public    | insert into public.chains select * from primary_public.chains;  | 00:13:39
296302 | ShareLock           | True    | chains                           | public    | insert into public.chains select * from primary_public.chains;  | 00:13:39
296662 | AccessExclusiveLock | False   | chains                           | public    | truncate table public.chains, ...                               | 00:14:15
296663 | AccessExclusiveLock | False   | chains                           | public    | truncate table public.chains, ...                               | 00:14:15
298104 | AccessShareLock     | False   | chains                           | public    | DECLARE c1 CURSOR FOR SELECT id, name, ...                      | 00:13:39
296302 | AccessExclusiveLock | True    | deployment_schemas               | public    | insert into public.chains select * from primary_public.chains;  | 00:13:39
298091 | AccessShareLock     | False   | deployment_schemas               | public    | SELECT ds.id, ss.row_estimate::bigint, ...                      | 00:13:32
298102 | AccessShareLock     | False   | deployment_schemas               | public    | SELECT ds.id, ss.row_estimate::bigint, ...                      | 00:13:32
...    | (18 more blocked)   | False   | deployment_schemas               | public    | SELECT ds.id, ss.row_estimate::bigint, ...                      | 00:12:43+
296302 | AccessExclusiveLock | True    | active_copies                    | public    | insert into public.chains select * from primary_public.chains;  | 00:13:39
296302 | AccessExclusiveLock | True    | subgraph                         | subgraphs | insert into public.chains select * from primary_public.chains;  | 00:13:39
296302 | AccessExclusiveLock | True    | subgraph_version                 | subgraphs | insert into public.chains select * from primary_public.chains;  | 00:13:39
296302 | AccessExclusiveLock | True    | subgraph_deployment_assignment   | subgraphs | insert into public.chains select * from primary_public.chains;  | 00:13:39

Query 3 — Primary database health (confirming FDW zombie):

SELECT pid, state, wait_event_type, wait_event, now() - backend_start AS connection_age,
       now() - query_start AS query_duration, left(query, 120) AS query, client_addr
FROM pg_stat_activity WHERE client_addr IS NOT NULL
ORDER BY connection_age DESC LIMIT 30;

Result — zero active FDW queries from the shard on the primary:

pid     | state | query                          | client_addr
2761573 | active| (the diagnostic query itself)  | 101.99.6.63 (pgAdmin)
2995474 | idle  | LISTEN chain_head_updates      | 157.180.55.98 (graph node)
2995850 | idle  | LISTEN store_events            | 157.180.55.97 (graph node)

No FDW connection from the shard — confirming the connection is a zombie. The shard is waiting for a response from a connection that no longer exists on the primary.

Why v0.41.0 was not affected

v0.41.0 used PoolInner::with_conn() which provided two protections removed in the diesel-async migration:

  1. limiter semaphore (capacity = pool_size) — serialized all pool operations including mirror_primary_tables(), preventing multiple nodes from issuing concurrent TRUNCATEs
  2. CancelHandle — checked is_canceled() between steps (after TRUNCATE, after each FDW INSERT). When the calling future was dropped, subsequent steps would abort

v0.42.0's mirror_primary_tables() uses self.get().await? directly — no limiter, no cancel mechanism. A stuck FDW query holds ACCESS EXCLUSIVE locks indefinitely with no way to abort.

Note: Even v0.41.0's CancelHandle could not interrupt a stuck FDW query mid-execution (it only checked between steps). The SET LOCAL statement_timeout fix is actually stronger — PostgreSQL kills the query at the database level regardless of execution state.

Fix

Add SET LOCAL timeouts in refresh_tables(), just before the TRUNCATE operation:

conn.batch_execute("SET LOCAL statement_timeout = '60s'; SET LOCAL lock_timeout = '10s';")
    .await
    .map_err(StoreError::from)?;
  • lock_timeout = 10s: If TRUNCATE cannot acquire locks within 10s (another node is mirroring), the transaction aborts instead of queuing. The job retries in 15 minutes.
  • statement_timeout = 60s: If any FDW query (which normally completes in <5s) is stuck for 60s, PostgreSQL kills it. The transaction rolls back, releasing all locks.
  • SET LOCAL: Scoped to this transaction only — zero impact on indexing, GraphQL queries, or any other operations.

Impact

  • Normal operation: No change — timeout never triggers (operations complete in <5s)
  • FDW zombie scenario: Lock held for max ~70s (vs unlimited before), then auto-released
  • Performance: Zero overhead — SET LOCAL is metadata-only, no timers or polling
  • Single-shard deployments: Not affected (fix is in a code path that only runs on non-primary shards)

conn.batch_execute("SET LOCAL statement_timeout = '60s'; SET LOCAL lock_timeout = '10s';")
.await
.map_err(StoreError::from)?;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor nit: could you move this code down below the declaration of copy_table, i.e., just before the comment // Truncate all tables at once? Mixing code and fn declarations makes this a little hard to read

…to prevent FDW zombie locks

Add SET LOCAL timeouts to refresh_tables() to prevent FDW zombie
connections from holding ACCESS EXCLUSIVE locks on mirrored tables
indefinitely. This fixes a v0.42.0 regression where stuck FDW queries
during mirror_primary_tables() cascade into full system lock.

- statement_timeout=60s: kills stuck FDW queries (normally <5s)
- lock_timeout=10s: prevents TRUNCATE stampede from multiple nodes

Both are SET LOCAL (transaction-scoped), zero impact on other queries.
@suntzu93 suntzu93 force-pushed the fix/fdw-zombie-lock-timeout branch from 7066bf4 to de4574f Compare March 20, 2026 22:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants