fix: add statement_timeout and lock_timeout to mirror_primary_tables … by suntzu93 · Pull Request #6446 · graphprotocol/graph-node

suntzu93 · 2026-03-20T03:22:43Z

Fix: Add statement_timeout and lock_timeout to mirror_primary_tables to prevent FDW zombie locks

Problem

In multi-shard deployments running v0.42.0, Graph Node periodically hangs with connection checkout timeouts across all shards:

Connection checkout timed out, wait_ms: 30000, pool: main, shard: shard2
Connection checkout timed out, wait_ms: 30001, pool: main, shard: shard3
Connection checkout timed out, wait_ms: 30000, pool: main, shard: shard1

This occurs when multiple sync nodes (with block ingestion enabled) are running in a multi-shard configuration. Single-shard deployments (no FDW) and read-only nodes (DISABLE_BLOCK_INGESTOR=true) are not affected.

Root Cause

Mirror::refresh_tables() is called every 15 minutes by the MirrorPrimary job, which is registered inside spawn_block_ingestor() → register_store_jobs(). This means it runs on every node that has block ingestion enabled (i.e., nodes without DISABLE_BLOCK_INGESTOR=true). It performs:

TRUNCATE on 6 tables at once — acquires ACCESS EXCLUSIVE lock on chains, deployment_schemas, active_copies, subgraph, subgraph_version, subgraph_deployment_assignment
Sequential INSERT ... SELECT * FROM primary_public.{table} via Foreign Data Wrapper (FDW) to repopulate each table

If the FDW connection to the primary becomes a zombie (connection dropped but shard hasn't detected it via TCP keepalive), the FDW INSERT hangs indefinitely on PostgresFdwGetResult while holding ACCESS EXCLUSIVE locks on all 6 tables. This blocks every query touching these tables — including deployment_statuses(), deployment_sizes(), and Mirror::read_async() — causing a lock cascade that exhausts the connection pool.

In setups with multiple sync nodes (each per chain, none setting DISABLE_BLOCK_INGESTOR), all of them run MirrorPrimary independently. This creates a TRUNCATE stampede when multiple nodes attempt refresh_tables() on the same shard concurrently — each queuing for the ACCESS EXCLUSIVE lock held by the stuck FDW query.

Conditions to trigger

Multi-shard setup — mirror_primary_tables() is a no-op on the primary shard (if self.shard == *PRIMARY_SHARD { return Ok(()); })
Block ingestion enabled — MirrorPrimary job only runs on nodes without DISABLE_BLOCK_INGESTOR=true
Unstable FDW connection — network issue between shard and primary causes zombie connection (shard waiting for response from a dead connection)

Even with a single ingestor node, condition 3 alone is sufficient to cause indefinite ACCESS EXCLUSIVE locks. Multiple ingestors amplify the impact through TRUNCATE stampede.

PostgreSQL Investigation

Three diagnostic queries were run on production to confirm the exact lock chain.

Query 1 — Blocking chain (which PID blocks which):

SELECT blocked.pid AS blocked_pid, blocked.query AS blocked_query,
       blocking.pid AS blocking_pid, blocking.query AS blocking_query,
       blocked.wait_event_type, now() - blocked.query_start AS blocked_duration
FROM pg_stat_activity blocked
JOIN pg_locks bl ON bl.pid = blocked.pid AND NOT bl.granted
JOIN pg_locks gl ON gl.database = bl.database AND gl.relation = bl.relation AND gl.pid != bl.pid AND gl.granted
JOIN pg_stat_activity blocking ON blocking.pid = gl.pid
WHERE blocked.state = 'active'
ORDER BY blocked_duration DESC;

Result — all queries blocked by single PID 296302:

blocked_pid | blocked_query                                  | blocking_pid | blocking_query                                                    | wait | blocked_duration
296662      | truncate table public.chains, public.deploy... | 296302       | insert into public.chains select * from primary_public.chains;    | Lock | 00:13:47.364265
296663      | truncate table public.chains, public.deploy... | 296302       | insert into public.chains select * from primary_public.chains;    | Lock | 00:13:47.312739
298104      | DECLARE c1 CURSOR FOR SELECT id, name, net_... | 296302       | insert into public.chains select * from primary_public.chains;    | Lock | 00:13:11.301664
298091      | SELECT ds.id, ss.row_estimate::bigint, ...     | 296302       | insert into public.chains select * from primary_public.chains;    | Lock | 00:13:03.780706

PIDs 296662, 296663: Other nodes' mirror_primary_tables() attempting TRUNCATE — blocked
PID 298104: Mirror::read_async() reading public.chains — blocked
PID 298091 (+ 20 more): deployment_sizes() querying deployment_schemas — blocked

Query 2 — Lock details (pg_locks joined with pg_class):

SELECT l.pid, l.locktype, l.mode, l.granted, c.relname AS table_name, n.nspname AS schema,
       a.state, left(a.query, 80) AS query, now() - a.query_start AS duration
FROM pg_locks l
JOIN pg_class c ON l.relation = c.oid
JOIN pg_namespace n ON c.relnamespace = n.oid
JOIN pg_stat_activity a ON l.pid = a.pid
WHERE n.nspname IN ('public', 'subgraphs', 'info')
  AND c.relname IN ('chains', 'deployment_schemas', 'active_copies', 'subgraph', 'subgraph_version', 'subgraph_deployment_assignment')
ORDER BY c.relname, l.granted DESC, duration DESC;

Result — PID 296302 holds AccessExclusiveLock (granted=True) on ALL 6 tables:

pid    | mode                | granted | table_name                       | schema    | query                                                           | duration
296302 | AccessExclusiveLock | True    | chains                           | public    | insert into public.chains select * from primary_public.chains;  | 00:13:39
296302 | RowExclusiveLock    | True    | chains                           | public    | insert into public.chains select * from primary_public.chains;  | 00:13:39
296302 | ShareLock           | True    | chains                           | public    | insert into public.chains select * from primary_public.chains;  | 00:13:39
296662 | AccessExclusiveLock | False   | chains                           | public    | truncate table public.chains, ...                               | 00:14:15
296663 | AccessExclusiveLock | False   | chains                           | public    | truncate table public.chains, ...                               | 00:14:15
298104 | AccessShareLock     | False   | chains                           | public    | DECLARE c1 CURSOR FOR SELECT id, name, ...                      | 00:13:39
296302 | AccessExclusiveLock | True    | deployment_schemas               | public    | insert into public.chains select * from primary_public.chains;  | 00:13:39
298091 | AccessShareLock     | False   | deployment_schemas               | public    | SELECT ds.id, ss.row_estimate::bigint, ...                      | 00:13:32
298102 | AccessShareLock     | False   | deployment_schemas               | public    | SELECT ds.id, ss.row_estimate::bigint, ...                      | 00:13:32
...    | (18 more blocked)   | False   | deployment_schemas               | public    | SELECT ds.id, ss.row_estimate::bigint, ...                      | 00:12:43+
296302 | AccessExclusiveLock | True    | active_copies                    | public    | insert into public.chains select * from primary_public.chains;  | 00:13:39
296302 | AccessExclusiveLock | True    | subgraph                         | subgraphs | insert into public.chains select * from primary_public.chains;  | 00:13:39
296302 | AccessExclusiveLock | True    | subgraph_version                 | subgraphs | insert into public.chains select * from primary_public.chains;  | 00:13:39
296302 | AccessExclusiveLock | True    | subgraph_deployment_assignment   | subgraphs | insert into public.chains select * from primary_public.chains;  | 00:13:39

Query 3 — Primary database health (confirming FDW zombie):

SELECT pid, state, wait_event_type, wait_event, now() - backend_start AS connection_age,
       now() - query_start AS query_duration, left(query, 120) AS query, client_addr
FROM pg_stat_activity WHERE client_addr IS NOT NULL
ORDER BY connection_age DESC LIMIT 30;

Result — zero active FDW queries from the shard on the primary:

pid     | state | query                          | client_addr
2761573 | active| (the diagnostic query itself)  | 101.99.6.63 (pgAdmin)
2995474 | idle  | LISTEN chain_head_updates      | 157.180.55.98 (graph node)
2995850 | idle  | LISTEN store_events            | 157.180.55.97 (graph node)

No FDW connection from the shard — confirming the connection is a zombie. The shard is waiting for a response from a connection that no longer exists on the primary.

Why v0.41.0 was not affected

v0.41.0 used PoolInner::with_conn() which provided two protections removed in the diesel-async migration:

limiter semaphore (capacity = pool_size) — serialized all pool operations including mirror_primary_tables(), preventing multiple nodes from issuing concurrent TRUNCATEs
CancelHandle — checked is_canceled() between steps (after TRUNCATE, after each FDW INSERT). When the calling future was dropped, subsequent steps would abort

v0.42.0's mirror_primary_tables() uses self.get().await? directly — no limiter, no cancel mechanism. A stuck FDW query holds ACCESS EXCLUSIVE locks indefinitely with no way to abort.

Note: Even v0.41.0's CancelHandle could not interrupt a stuck FDW query mid-execution (it only checked between steps). The SET LOCAL statement_timeout fix is actually stronger — PostgreSQL kills the query at the database level regardless of execution state.

Fix

Add SET LOCAL timeouts in refresh_tables(), just before the TRUNCATE operation:

conn.batch_execute("SET LOCAL statement_timeout = '60s'; SET LOCAL lock_timeout = '10s';")
    .await
    .map_err(StoreError::from)?;

lock_timeout = 10s: If TRUNCATE cannot acquire locks within 10s (another node is mirroring), the transaction aborts instead of queuing. The job retries in 15 minutes.
statement_timeout = 60s: If any FDW query (which normally completes in <5s) is stuck for 60s, PostgreSQL kills it. The transaction rolls back, releasing all locks.
SET LOCAL: Scoped to this transaction only — zero impact on indexing, GraphQL queries, or any other operations.

Impact

Normal operation: No change — timeout never triggers (operations complete in <5s)
FDW zombie scenario: Lock held for max ~70s (vs unlimited before), then auto-released
Performance: Zero overhead — SET LOCAL is metadata-only, no timers or polling
Single-shard deployments: Not affected (fix is in a code path that only runs on non-primary shards)

lutter · 2026-03-20T17:38:12Z

store/postgres/src/primary.rs

+        conn.batch_execute("SET LOCAL statement_timeout = '60s'; SET LOCAL lock_timeout = '10s';")
+            .await
+            .map_err(StoreError::from)?;
+


One minor nit: could you move this code down below the declaration of copy_table, i.e., just before the comment // Truncate all tables at once? Mixing code and fn declarations makes this a little hard to read

…to prevent FDW zombie locks Add SET LOCAL timeouts to refresh_tables() to prevent FDW zombie connections from holding ACCESS EXCLUSIVE locks on mirrored tables indefinitely. This fixes a v0.42.0 regression where stuck FDW queries during mirror_primary_tables() cascade into full system lock. - statement_timeout=60s: kills stuck FDW queries (normally <5s) - lock_timeout=10s: prevents TRUNCATE stampede from multiple nodes Both are SET LOCAL (transaction-scoped), zero impact on other queries.

lutter approved these changes Mar 20, 2026

View reviewed changes

suntzu93 force-pushed the fix/fdw-zombie-lock-timeout branch from 7066bf4 to de4574f Compare March 20, 2026 22:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add statement_timeout and lock_timeout to mirror_primary_tables …#6446

fix: add statement_timeout and lock_timeout to mirror_primary_tables …#6446
suntzu93 wants to merge 1 commit intographprotocol:masterfrom
suntzu93:fix/fdw-zombie-lock-timeout

suntzu93 commented Mar 20, 2026 •

edited

Loading

Uh oh!

lutter Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

suntzu93 commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix: Add statement_timeout and lock_timeout to mirror_primary_tables to prevent FDW zombie locks

Problem

Root Cause

Conditions to trigger

PostgreSQL Investigation

Why v0.41.0 was not affected

Fix

Impact

Uh oh!

lutter Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

suntzu93 commented Mar 20, 2026 •

edited

Loading