Skip to content

Conversation

@ibrarahmad
Copy link
Contributor

The relmetacache_flush function was calling hash_search with HASH_REMOVE while iterating through RelMetaCache using hash_seq_search. This corrupts the iterator state because removing an entry deallocates memory that the iterator references, reorganizes bucket chains that invalidate the iterator's position, and may trigger dynamic resizing. The result is segmentation faults from accessing freed memory, skipped entries when the iterator loses its position, or infinite loops from corrupted chain pointers. Fixed by collecting all relids in a first pass, then removing them in a second pass after iteration completes.

The relmetacache_prune function had the same issue when removing invalid cache entries during iteration. It was calling hash_search with HASH_REMOVE on entries marked as invalid while hash_seq_search was actively scanning the hash table. This causes identical corruption problems as the iterator's cached bucket index and chain pointers become stale when entries are physically removed from the table. Fixed using the same two-pass pattern: collect relids of invalid entries during iteration, then remove them after the scan finishes.

The reset_channel_stats function was removing all statistics entries from SpockHash while iterating with hash_seq_search. This violated the fundamental rule that hash tables must not be modified during sequential scans. The iterator maintains pointers to the current entry and the next entry to visit, both of which become dangling pointers when hash_search removes the current entry. Under concurrent load this causes crashes with hash table corrupted errors and potential loss of statistics data. Fixed by copying all keys during iteration, then removing entries in a separate loop after the iterator completes its scan.

The relmetacache_flush function was calling hash_search with HASH_REMOVE
while iterating through RelMetaCache using hash_seq_search. This corrupts
the iterator state because removing an entry deallocates memory that the
iterator references, reorganizes bucket chains that invalidate the
iterator's position, and may trigger dynamic resizing. The result is
segmentation faults from accessing freed memory, skipped entries when the
iterator loses its position, or infinite loops from corrupted chain
pointers. Fixed by collecting all relids in a first pass, then removing
them in a second pass after iteration completes.

The relmetacache_prune function had the same issue when removing invalid
cache entries during iteration. It was calling hash_search with
HASH_REMOVE on entries marked as invalid while hash_seq_search was
actively scanning the hash table. This causes identical corruption
problems as the iterator's cached bucket index and chain pointers become
stale when entries are physically removed from the table. Fixed using the
same two-pass pattern: collect relids of invalid entries during
iteration, then remove them after the scan finishes.

The reset_channel_stats function was removing all statistics entries from
SpockHash while iterating with hash_seq_search. This violated the
fundamental rule that hash tables must not be modified during sequential
scans. The iterator maintains pointers to the current entry and the next
entry to visit, both of which become dangling pointers when hash_search
removes the current entry. Under concurrent load this causes crashes with
hash table corrupted errors and potential loss of statistics data. Fixed
by copying all keys during iteration, then removing entries in a separate
loop after the iterator completes its scan.
@mason-sharp mason-sharp requested review from danolivo and rasifr January 7, 2026 17:41
@danolivo
Copy link
Contributor

danolivo commented Jan 8, 2026

In the code:

old_ctxt = MemoryContextSwitchTo(RelMetaCacheContext);
		RelMetaCache = hash_create("spock relation metadata cache",
								   RELMETACACHE_INITIAL_SIZE,
								   &ctl, hash_flags);
		(void) MemoryContextSwitchTo(old_ctxt);

It is not necessary to use MemoryContextSwitchTo at all - we already instructed hash_create to use a specific memory context. Hence, we may remove the old_ctxt variable altogether.

Ibrar Ahmed and others added 5 commits January 8, 2026 19:41
The previous commit added extensive documentation explaining the hash
table iterator corruption issue. While thorough, these comments were
overly verbose for production code. This commit condenses them to be
more concise while preserving the essential information: the problem
(calling hash_search HASH_REMOVE during hash_seq_search corrupts the
iterator state), the consequences (crashes, skipped entries), and the
solution (two-pass approach: collect keys, then remove).

Simplified comments in relmetacache_flush, relmetacache_prune, and
reset_channel_stats functions without changing any code logic.
The hash_create call was configured with HASH_CONTEXT flag and
ctl.hcxt = RelMetaCacheContext, which instructs the hash table API to
perform all allocations in the specified context automatically. The
explicit MemoryContextSwitchTo calls around hash_create and hash_search
were therefore redundant and have been removed.

This simplifies the code by removing 2 variable declarations and 4
lines of context switch boilerplate. The hash table operations will
continue to use RelMetaCacheContext for all allocations as configured.
1. Missing NULL checks after hash_search(HASH_ENTER) on HASH_FIXED_SIZE
   tables. When SpockHash or SpockGroupHash reach capacity, hash_search
   returns NULL instead of expanding the table. Dereferencing NULL causes
   segmentation faults. Added NULL checks in handle_stats_counter(),
   spock_group_attach(), and spock_group_progress_update().

2. Race condition in handle_stats_counter() where the function releases
   LW_SHARED, checks hash fullness without lock, then reacquires
   LW_EXCLUSIVE. Between these locks, concurrent processes can fill the
   table, causing HASH_ENTER to return NULL unexpectedly. The added NULL
   check handles this TOCTOU (time-of-check-to-time-of-use) race.

3. Lock mismatch in get_channel_stats() causing torn reads of 64-bit
   counter values. The reader held only LW_SHARED while writers used
   per-entry spinlocks (entry->mutex). Added spinlock acquisition in
   reader to ensure atomic access and prevent data races that violate
   the C memory model.

All three fixes prevent crashes from NULL pointer dereference and ensure
data consistency for statistics counters in shared memory hash tables.
Replace the two-pass approach (collect keys then remove) with the simpler
pattern used in PostgreSQL core code: iterate once and remove entries
during iteration. This matches the style used in ShippableCache and
other similar hash table flush operations in PostgreSQL.

The previous implementation re-initialized the iterator after each
removal to avoid iterator corruption, but PostgreSQL's pattern of
removing entries during iteration works correctly for hash tables when
flushing all entries.
@danolivo
Copy link
Contributor

There are already 6 commits on the branch. Maybe squash/restructure them somehow?

@ibrarahmad
Copy link
Contributor Author

New clean PR created

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants