Skip to content

(fix): fallback type memory leak in checkpoint/rewind cycle#19

Merged
mgyoo86 merged 5 commits intomasterfrom
fix/memory_leak
Mar 4, 2026
Merged

(fix): fallback type memory leak in checkpoint/rewind cycle#19
mgyoo86 merged 5 commits intomasterfrom
fix/memory_leak

Conversation

@mgyoo86
Copy link
Collaborator

@mgyoo86 mgyoo86 commented Mar 4, 2026

Problem

Fallback types (e.g., ForwardDiff.Dual) stored in pool.others leaked n_active on repeated @with_pool calls. After N iterations, n_active grew to N instead of returning to 0.

Root cause — two independent bugs in _touched_has_others tracking:

  1. get_typed_pool!: When creating a new fallback type inside a checkpoint scope, _touched_has_others was not set to true. The rewind path (_typed_lazy_rewind!) skipped pool.others entirely.

  2. checkpoint!(pool, types...): Always pushed has_others = false, regardless of whether any type was a fallback. On the 2nd+ call (type already exists, get_typed_pool! closure doesn't fire), the flag stayed false — causing the same skip in _typed_lazy_rewind!.

Fix

Location Change
get_typed_pool! Set _touched_has_others[depth] = true when auto-checkpointing a new fallback type
checkpoint!(pool, ::Type{T}) Push _fixed_slot_bit(T) == UInt16(0) (constant-folded at compile time)
checkpoint!(pool, types...) Compute has_any_fallback at compile time in @generated body via _fixed_slot_bit
_checkpoint_typed_pool! Add idempotency guard to prevent double-push from get_typed_pool! auto-checkpoint + explicit checkpoint
CUDA extension Mirror all fixes for CuAdaptiveArrayPool parity

Performance impact: zero. All fallback detection is resolved at compile time (@generated body or constant folding). The idempotency guard adds one @inbounds array-end check (~0.3ns).

Tests

  • Added 28-section test_fallback_reclamation.jl covering all fallback lifecycle scenarios
  • Tests 29, 29b specifically reproduce the real-world ForwardDiff.gradient leak pattern with repeated typed checkpoints
  • TDD-validated: tests fail without fix, pass with fix
  • Full suite: 387 tests passing, 96% coverage

mgyoo86 added 2 commits March 3, 2026 21:26
…ype leak

When a new fallback type (non-fixed-slot) was first created inside a
@with_pool scope, get_typed_pool! auto-checkpointed it but did not set
_touched_has_others[depth] = true. In typed-lazy mode, _acquire_impl!
bypasses _record_type_touch!, so the has_others flag stayed false,
causing lazy/typed-lazy rewind to skip pool.others entirely and leak
the new type's n_active count.

Fix: set _touched_has_others = true in get_typed_pool!'s slow path
(both CPU and CUDA extension).

Add test/test_fallback_reclamation.jl with 28 test sections (387 tests)
covering fallback type reclamation, parametric struct (FakeDual) leak
detection, and the exact _acquire_impl! bypass bug reproduction.
…eckpoints

checkpoint!(pool, types...) always pushed has_others=false, causing
_typed_lazy_rewind! to skip pool.others on 2nd+ calls when the fallback
type already existed. Fix: compute has_any_fallback at compile time via
_fixed_slot_bit check. Add idempotency guard to _checkpoint_typed_pool!
to prevent double-push from get_typed_pool! auto-checkpoint + explicit
checkpoint call. Apply same fixes to CUDA extension for parity.
@codecov
Copy link

codecov bot commented Mar 4, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.47%. Comparing base (3645a32) to head (13a64de).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##           master      #19   +/-   ##
=======================================
  Coverage   96.46%   96.47%           
=======================================
  Files           9        9           
  Lines        1273     1276    +3     
=======================================
+ Hits         1228     1231    +3     
  Misses         45       45           
Files with missing lines Coverage Δ
src/state.jl 98.71% <100.00%> (+0.01%) ⬆️
src/types.jl 100.00% <100.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a leak in fallback-type (pool.others) accounting during repeated checkpoint/rewind cycles—especially in typed / typed-lazy macro paths where _acquire_impl! bypasses _record_type_touch!—and adds regression tests (including a ForwardDiff.Dual-like scenario) plus CUDA extension parity.

Changes:

  • Ensure fallback pools mark has_others during first-touch creation (get_typed_pool!) so lazy/typed-lazy rewind doesn’t skip pool.others.
  • Correct checkpoint! (single- and multi-type) to push has_others=true when any tracked type is a fallback; add an idempotency guard to avoid double checkpoint pushes.
  • Add an extensive fallback reclamation test suite and include it in runtests.jl; mirror fixes in the CUDA extension.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/types.jl Marks _touched_has_others when creating a new fallback typed pool inside a checkpointed scope.
src/state.jl Fixes has_others tracking for typed checkpoints; adds checkpoint idempotency guard.
ext/AdaptiveArrayPoolsCUDAExt/dispatch.jl Mirrors the get_typed_pool! fallback-touch fix for CUDA pools.
ext/AdaptiveArrayPoolsCUDAExt/state.jl Mirrors has_others logic for CUDA typed checkpoints (single + varargs).
test/test_fallback_reclamation.jl Adds comprehensive regression coverage for fallback lifecycle / leak scenarios.
test/runtests.jl Runs the new fallback reclamation test file in the suite.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mgyoo86 added 2 commits March 3, 2026 21:42
Replace fresh_pool assertion (always empty, never tests actual rewind)
with task-local pool n_active checks for UInt8 and Float16.
Sentinel pattern ([0] initial value) guarantees _checkpoint_depths is
never empty. _rewind_typed_pool! already relies on this invariant with
unguarded @inbounds access — align checkpoint to the same contract.
@mgyoo86 mgyoo86 merged commit f3299c2 into master Mar 4, 2026
10 checks passed
@mgyoo86 mgyoo86 deleted the fix/memory_leak branch March 4, 2026 05:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants