ProjectTorreyPines · mgyoo86 · Mar 6, 2026 · Mar 5, 2026 · Mar 5, 2026 · Mar 5, 2026
diff --git a/.codecov.yml b/.codecov.yml
@@ -10,6 +10,7 @@ coverage:
 
 ignore:
   - "ext/**/*"
+  - "src/legacy/**/*"
 
 comment:
   layout: "reach,diff,flags,files"

diff --git a/Project.toml b/Project.toml
@@ -1,7 +1,7 @@
 name = "AdaptiveArrayPools"
 uuid = "4f381ef7-9af0-4cbe-99d4-cf36d7b0f233"
-version = "0.2.1"
 authors = ["Min-Gu Yoo <mgyoo86@gmail.com>"]
+version = "0.2.1"
 
 [deps]
 Preferences = "21216c6a-2e73-6563-6e65-726566657250"
@@ -14,7 +14,7 @@ CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
 AdaptiveArrayPoolsCUDAExt = "CUDA"
 
 [compat]
-julia = "1.10"
-Preferences = "1"
 CUDA = "5"
-Printf = "1"
+Preferences = "1"
+Printf = "1"
+julia = "1.10"
diff --git a/docs/design/cuda_extension_design.md b/docs/design/cuda_extension_design.md
@@ -1,9 +1,10 @@
 # AdaptiveArrayPools.jl CUDA Extension Design
 
-> **Status**: Draft v0.6 (Post-Review Revision)
-> **Version**: 0.6
-> **Date**: 2024-12-14
-> **Authors**: Design discussion with AI assistance
+> **Update (v0.2.2, feat/new_array_nd)**: The CPU path now uses `setfield!`-based wrapper
+> reuse (Julia 1.11+) instead of the N-way cache for `unsafe_acquire!`. The **CUDA extension
+> still uses the N-way set-associative cache** described in this document, since `CuArray`
+> does not support `setfield!`-based field mutation. `CACHE_WAYS` and `set_cache_ways!` are
+> now only relevant for the CUDA backend (and Julia 1.10 legacy CPU path).
 
 ## 1. Executive Summary
 

diff --git a/docs/design/hybrid_api_design.md b/docs/design/hybrid_api_design.md
@@ -1,5 +1,13 @@
 # Hybrid API Design: acquire! vs unsafe_acquire!
 
+> **Update (v0.2.2, feat/new_array_nd)**: The `unsafe_acquire!` path no longer uses
+> `unsafe_wrap` + N-way cache on Julia 1.11+ CPU. Instead, it uses `setfield!`-based
+> wrapper reuse — **0-alloc for any number of dimension patterns** (no eviction limit).
+> The N-way cache (`CACHE_WAYS`) is now only used by the **CUDA** backend and the
+> **Julia 1.10 legacy** fallback. The `acquire!` → `ReshapedArray` path is unchanged.
+> `TypedPool` fields changed: `nd_arrays`/`nd_dims`/`nd_ptrs`/`nd_next_way` →
+> `nd_wrappers::Vector{Union{Nothing, Vector{Any}}}`.
+
 ## Executive Summary
 
 Redesigning `AdaptiveArrayPools.jl`'s N-D array acquisition API with a **Two Tools Strategy**:

diff --git a/docs/design/nd_array_approach_comparison.md b/docs/design/nd_array_approach_comparison.md
@@ -1,5 +1,12 @@
 # N-D Array Approach Comparison: unsafe_wrap vs ReshapedArray
 
+> **Update (v0.2.2, feat/new_array_nd)**: The N-way set-associative cache described in this
+> document has been **superseded on Julia 1.11+ CPU** by `setfield!`-based wrapper reuse
+> (`nd_wrappers` indexed by dimensionality N). This achieves **0-alloc for unlimited dimension
+> patterns** — no eviction, no `CACHE_WAYS` limit. The N-way cache remains in use for
+> **CUDA** and the **Julia 1.10 legacy** path. The `acquire!` → `ReshapedArray` path is
+> unchanged. See `src/acquire.jl` and `src/types.jl` for the current implementation.
+
 ## Summary
 
 This document analyzes two approaches for returning N-dimensional arrays from AdaptiveArrayPools:

diff --git a/docs/src/architecture/design-docs.md b/docs/src/architecture/design-docs.md
@@ -10,7 +10,7 @@ For in-depth analysis of design decisions, implementation tradeoffs, and archite
 ## Caching & Performance
 
 - **[nd_array_approach_comparison.md](https://github.com/ProjectTorreyPines/AdaptiveArrayPools.jl/blob/master/docs/design/nd_array_approach_comparison.md)**
-  N-way cache design, boxing analysis, and ReshapedArray benchmarks
+  N-way cache design (now legacy — replaced by `setfield!` reuse on Julia 1.11+ CPU), boxing analysis, and ReshapedArray benchmarks
 
 - **[fixed_slots_codegen_design.md](https://github.com/ProjectTorreyPines/AdaptiveArrayPools.jl/blob/master/docs/design/fixed_slots_codegen_design.md)**
   Zero-allocation iteration via `@generated` functions and fixed-slot type dispatch
@@ -32,7 +32,7 @@ For in-depth analysis of design decisions, implementation tradeoffs, and archite
 | Document | Focus Area | Key Insights |
 |----------|------------|--------------|
 | hybrid_api_design | API strategy | View types for zero-alloc, Array for FFI |
-| nd_array_approach_comparison | Caching | N-way associative cache reduces header allocation |
+| nd_array_approach_comparison | Caching | N-way cache (legacy); setfield! reuse on Julia 1.11+ CPU |
 | fixed_slots_codegen_design | Codegen | @generated functions enable type-stable iteration |
 | untracked_acquire_design | Macro safety | Sentinel pattern ensures correct cleanup |
 | cuda_extension_design | GPU support | Seamless CPU/CUDA API parity |

diff --git a/docs/src/architecture/how-it-works.md b/docs/src/architecture/how-it-works.md
@@ -82,57 +82,66 @@ end
 
 When you call `acquire!(pool, Float64, n)`, the compiler inlines directly to `pool.float64` — no dictionary lookup, no type instability.
 
-## N-Way Set Associative Cache
+## N-D Wrapper Reuse (CPU)
 
-For `unsafe_acquire!` (which returns native `Array` types), we use an N-way cache to reduce header allocation:
+For `unsafe_acquire!` (which returns native `Array` types), the caching strategy depends on the Julia version:
+
+### Julia 1.11+: `setfield!`-based Wrapper Reuse (Zero-Allocation)
+
+Julia 1.11 changed `Array` from an opaque C struct to a mutable Julia struct with `ref::MemoryRef{T}` and `size::NTuple{N,Int}` fields. This enables in-place mutation of cached `Array` wrappers via `setfield!`:
 
 ```
-                    CACHE_WAYS = 4 (default)
-                    ┌────┬────┬────┬────┐
-Slot 0 (Float64):   │way0│way1│way2│way3│  ← round-robin eviction
-                    └────┴────┴────┴────┘
-                    ┌────┬────┬────┬────┐
-Slot 1 (Float32):   │way0│way1│way2│way3│
-                    └────┴────┴────┴────┘
-                    ...
+nd_wrappers[N][slot] → cached Array{T,N}
+    │
+    ├─ setfield!(:ref, new_memory_ref)   ← update backing memory (0-alloc)
+    └─ setfield!(:size, new_dims)        ← update dimensions (0-alloc)
 ```
 
-### Cache Lookup Pseudocode
+**Result**: Unlimited dimension patterns per slot with **zero allocation** after warmup. No eviction, no round-robin, no `CACHE_WAYS` limit.
 
 ```julia
+# Pseudocode for Julia 1.11+ path
 function unsafe_acquire!(pool, T, dims...)
     typed_pool = get_typed_pool!(pool, T)
-    slot = n_active + 1
-    base = (slot - 1) * CACHE_WAYS
-
-    # Search all ways for matching dimensions
-    for k in 1:CACHE_WAYS
-        idx = base + k
-        if dims == typed_pool.nd_dims[idx]
-            # Cache hit! Check if underlying vector was resized
-            if pointer matches
-                return typed_pool.nd_arrays[idx]
-            end
-        end
+    flat_view = get_view!(typed_pool, prod(dims))
+    slot = typed_pool.n_active
+
+    # Direct index lookup by dimensionality N (~1ns)
+    wrapper = typed_pool.nd_wrappers[N][slot]
+    if wrapper !== nothing
+        setfield!(wrapper, :ref, getfield(vec, :ref))  # 0-alloc
+        setfield!(wrapper, :size, dims)                 # 0-alloc
+        return wrapper
     end
 
-    # Cache miss: create new Array header, store in next way (round-robin)
-    way = typed_pool.nd_next_way[slot]
-    typed_pool.nd_next_way[slot] = (way + 1) % CACHE_WAYS
-    # ... create and cache Array ...
+    # First call for this (slot, N): unsafe_wrap once, cached forever
+    arr = wrap_array(typed_pool, flat_view, dims)
+    store_wrapper!(typed_pool, N, slot, arr)
+    return arr
 end
 ```
 
-**Key insight**: Even on cache miss, only the `Array` header (~80-144 bytes) is allocated. The actual data memory is always reused from the pool.
+### Julia 1.10 (Legacy): N-Way Set Associative Cache
+
+On Julia 1.10, `Array` fields cannot be mutated, so the legacy path uses a 4-way set-associative cache with round-robin eviction:
+
+- Cache hit (≤`CACHE_WAYS` dimension patterns per slot): **0 bytes**
+- Cache miss (>`CACHE_WAYS` patterns): **~80-144 bytes** per `unsafe_wrap` call
+
+See [Configuration](../features/configuration.md) for `CACHE_WAYS` tuning (Julia 1.10 / CUDA only).
+
+### CUDA: N-Way Cache
+
+The CUDA backend still uses the N-way set-associative cache (same as Julia 1.10 legacy), since `CuArray` does not support `setfield!`-based mutation.
 
 ## View vs Array Return Types
 
 Type stability is critical for performance. AdaptiveArrayPools provides two APIs:
 
-| API | 1D Return | N-D Return | Allocation |
-|-----|-----------|------------|------------|
-| `acquire!` | `SubArray{T,1}` | `ReshapedArray{T,N}` | Always 0 bytes |
-| `unsafe_acquire!` | `Vector{T}` | `Array{T,N}` | 0 bytes (hit) / ~100 bytes (miss) |
+| API | 1D Return | N-D Return | Allocation (Julia 1.11+) | Allocation (Julia 1.10 / CUDA) |
+|-----|-----------|------------|--------------------------|-------------------------------|
+| `acquire!` | `SubArray{T,1}` | `ReshapedArray{T,N}` | Always 0 bytes | Always 0 bytes |
+| `unsafe_acquire!` | `Vector{T}` | `Array{T,N}` | 0 bytes (setfield! reuse) | 0 bytes (hit) / ~100 bytes (miss) |
 
 !!! note "`Bit` type behavior"
     For `T === Bit`, both `acquire!` and `unsafe_acquire!` return native `BitVector` / `BitArray{N}` (not views). Cache hit achieves 0 bytes allocation.

diff --git a/docs/src/architecture/type-dispatch.md b/docs/src/architecture/type-dispatch.md
@@ -21,57 +21,39 @@ end
 
 When you call `acquire!(pool, Float64, n)`, the compiler inlines directly to `pool.float64` - no dictionary lookup, no type instability.
 
-## N-Way Set Associative Cache
+## N-D Wrapper Caching for `unsafe_acquire!`
 
-For `unsafe_acquire!` (which returns native `Array` types), we use an N-way cache to reduce header allocation:
+`unsafe_acquire!` returns native `Array` types. The caching strategy depends on Julia version:
 
-```
-                    CACHE_WAYS = 4 (default)
-                    +----+----+----+----+
-Slot 0 (Float64):   |way0|way1|way2|way3|  <-- round-robin eviction
-                    +----+----+----+----+
-                    +----+----+----+----+
-Slot 1 (Float32):   |way0|way1|way2|way3|
-                    +----+----+----+----+
-                    ...
-```
+### Julia 1.11+: `setfield!`-based Wrapper Reuse
 
-### Cache Lookup Logic
+Julia 1.11 made `Array` a mutable struct, enabling in-place field mutation:
 
 ```julia
-function unsafe_acquire!(pool, T, dims...)
-    typed_pool = get_typed_pool!(pool, T)
-    slot = n_active + 1
-    base = (slot - 1) * CACHE_WAYS
-
-    # Search all ways for matching dimensions
-    for k in 1:CACHE_WAYS
-        idx = base + k
-        if dims == typed_pool.nd_dims[idx]
-            # Cache hit! Check if underlying vector was resized
-            if pointer matches
-                return typed_pool.nd_arrays[idx]
-            end
-        end
-    end
-
-    # Cache miss: create new Array header, store in next way (round-robin)
-    way = typed_pool.nd_next_way[slot]
-    typed_pool.nd_next_way[slot] = (way % CACHE_WAYS) + 1
-    # ... create and cache Array ...
-end
+# Cached wrapper reuse via setfield! (0-alloc)
+setfield!(cached_arr, :ref, new_memory_ref)   # update backing memory
+setfield!(cached_arr, :size, new_dims)         # update dimensions
 ```
 
-**Key insight**: Even on cache miss, only the `Array` header (~80-144 bytes) is allocated. The actual data memory is always reused from the pool.
+Wrappers are stored in `nd_wrappers::Vector{Union{Nothing, Vector{Any}}}`, indexed directly by dimensionality N (~1ns lookup). **Unlimited dimension patterns per slot, zero allocation after warmup.**
+
+### Julia 1.10 / CUDA: N-Way Set Associative Cache
+
+On Julia 1.10 (CPU) and CUDA, `Array`/`CuArray` fields cannot be mutated. These paths use a 4-way set-associative cache with round-robin eviction (`CACHE_WAYS = 4` default):
+
+- **Cache hit** (≤4 dim patterns per slot): 0 bytes
+- **Cache miss** (>4 patterns): ~80-144 bytes for Array header allocation
+
+See [Configuration](../features/configuration.md) for `CACHE_WAYS` tuning.
 
 ---
 
 ## View vs Array: When to Use What?
 
-| API | Return Type | Allocation | Recommended For |
-|-----|-------------|------------|-----------------|
-| `acquire!` | `SubArray` / `ReshapedArray` | **Always 0 bytes** | 99% of cases |
-| `unsafe_acquire!` | `Vector` / `Array` | 0-144 bytes | FFI, type constraints |
+| API | Return Type | Allocation (Julia 1.11+) | Allocation (1.10 / CUDA) | Recommended For |
+|-----|-------------|--------------------------|--------------------------|-----------------|
+| `acquire!` | `SubArray` / `ReshapedArray` | **Always 0 bytes** | **Always 0 bytes** | 99% of cases |
+| `unsafe_acquire!` | `Vector` / `Array` | **0 bytes** (setfield! reuse) | 0-144 bytes (N-way cache) | FFI, type constraints |
 
 ### Why View is the Default
 
@@ -116,15 +98,15 @@ end
 
 | Operation | acquire! (View) | unsafe_acquire! (Array) |
 |-----------|-----------------|-------------------------|
-| Allocation (cached) | 0 bytes | 0 bytes |
-| Allocation (miss) | 0 bytes | 80-144 bytes |
+| Allocation (Julia 1.11+) | 0 bytes | 0 bytes (setfield! reuse) |
+| Allocation (Julia 1.10 / CUDA) | 0 bytes | 0 bytes (hit) / 80-144 bytes (miss) |
 | BLAS operations | Identical | Identical |
 | Type stability | Guaranteed | Guaranteed |
 | FFI compatibility | Requires conversion | Direct |
 
-### Header Size by Dimensionality
+### Header Size by Dimensionality (Julia 1.10 / CUDA only)
 
-When `unsafe_acquire!` has a cache miss:
+On Julia 1.11+ CPU, `unsafe_acquire!` is always zero-allocation via `setfield!` reuse. On Julia 1.10 and CUDA, a cache miss allocates an `Array` header:
 
 | Dimensions | Header Size |
 |------------|-------------|

diff --git a/docs/src/basics/api-essentials.md b/docs/src/basics/api-essentials.md
@@ -21,7 +21,7 @@ end
 
 ### `unsafe_acquire!(pool, T, dims...)`
 
-Returns a native `Array` type. **Zero-allocation on cache hit**—only allocates a small header (~80-144 bytes) on cache miss. Use when you specifically need `Array{T,N}`:
+Returns a native `Array` type. On **Julia 1.11+**, always **zero-allocation** via `setfield!`-based wrapper reuse (unlimited dimension patterns). On Julia 1.10 and CUDA, zero-allocation on cache hit with a small header (~80-144 bytes) on cache miss. Use when you specifically need `Array{T,N}`:
 
 ```julia
 @with_pool pool begin
@@ -36,7 +36,7 @@ end
 ```
 
 !!! tip "Cache behavior"
-    Same dimension pattern → **0 bytes**. Different pattern → 80-144 bytes header only (data memory always reused). See [N-Way Cache](../architecture/type-dispatch.md#n-way-set-associative-cache) for details.
+    On Julia 1.11+: **always 0 bytes** regardless of dimension pattern (setfield!-based reuse). On Julia 1.10 / CUDA: same dimension pattern → 0 bytes, different pattern → 80-144 bytes header only (data always reused). See [N-D Wrapper Caching](../architecture/type-dispatch.md#n-d-wrapper-caching-for-unsafe_acquire) for details.
 
 !!! note "`Bit` behavior"
     For `T === Bit`, `unsafe_acquire!` is equivalent to `acquire!` and returns native `BitVector`/`BitArray{N}`.
@@ -113,7 +113,7 @@ end
 | Function | Returns | Allocation | Use Case |
 |----------|---------|------------|----------|
 | `acquire!(pool, T, dims...)` | View type | 0 bytes | Default choice |
-| `unsafe_acquire!(pool, T, dims...)` | `Array{T,N}` | 0 (hit) / 80-144 (miss) | FFI, type constraints |
+| `unsafe_acquire!(pool, T, dims...)` | `Array{T,N}` | 0 bytes (1.11+) / 0-144 (1.10/CUDA) | FFI, type constraints |
 | `zeros!(pool, [T,] dims...)` | View type | 0 bytes | Zero-initialized |
 | `ones!(pool, [T,] dims...)` | View type | 0 bytes | One-initialized |
 | `similar!(pool, A)` | View type | 0 bytes | Match existing array |

diff --git a/docs/src/features/bit-arrays.md b/docs/src/features/bit-arrays.md
@@ -79,16 +79,14 @@ Operations like `count()`, `sum()`, and bitwise broadcasting are **10x~100x fast
 
 ### N-D Caching & Zero Allocation
 
-The pool uses an N-way associative cache to efficiently reuse `BitArray{N}` instances:
+The pool reuses `BitArray{N}` wrapper instances via `setfield!`-based in-place mutation (Julia 1.11+) or N-way cache (Julia 1.10 / CUDA):
 
-| Scenario | Allocation |
-|----------|------------|
-| First call with new dims | ~944 bytes (new `BitArray{N}` created) |
-| Subsequent call with same dims | **0 bytes** (cached instance reused) |
-| Same ndims, different dims | **0 bytes** (dims/len fields modified in-place) |
-| Different ndims | ~944 bytes (new `BitArray{N}` created and cached) |
+| Scenario | Julia 1.11+ | Julia 1.10 / CUDA |
+|----------|-------------|-------------------|
+| First call with new (slot, N) | ~944 bytes (new `BitArray{N}`) | ~944 bytes |
+| Subsequent call, any dims | **0 bytes** (setfield! reuse) | **0 bytes** (same ndims) / ~944 bytes (different ndims) |
 
-Unlike regular `Array` where dimensions are immutable, `BitArray` allows in-place modification of its `dims` and `len` fields. The pool exploits this to achieve **zero allocation** on repeated calls with matching dimensionality.
+On Julia 1.11+, `BitArray` fields (`len`, `dims`, `chunks`) are mutated in-place via `setfield!`, achieving **zero allocation** on all repeated calls regardless of dimension pattern.
 
 ```julia
 @with_pool pool begin
@@ -98,12 +96,12 @@ Unlike regular `Array` where dimensions are immutable, `BitArray` allows in-plac
     # Rewind to reuse the same slot
     rewind!(pool)
 
-    # Same dims: 0 allocation (exact cache hit)
+    # Same dims: 0 allocation (cached wrapper reused)
     m2 = acquire!(pool, Bit, 100, 100)
 
     rewind!(pool)
 
-    # Different dims but same ndims: 0 allocation (dims modified in-place)
+    # Different dims but same ndims: 0 allocation (fields updated in-place)
     m3 = acquire!(pool, Bit, 50, 200)
 end
 ```

diff --git a/docs/src/features/configuration.md b/docs/src/features/configuration.md
@@ -70,9 +70,13 @@ POOL_DEBUG[] = false  # Disable (default, production)
 
 When enabled, returning a pool-backed array from a `@with_pool` block will throw an error.
 
-## Compile-time: CACHE_WAYS
+## Compile-time: CACHE_WAYS (Julia 1.10 / CUDA only)
 
-Configure the N-way cache size for `unsafe_acquire!`. Higher values reduce cache eviction but increase memory per slot.
+Configure the N-way cache size for `unsafe_acquire!`. **On Julia 1.11+ CPU, this setting has no effect** — the `setfield!`-based wrapper reuse supports unlimited dimension patterns with zero allocation.
+
+This setting is relevant for:
+- **Julia 1.10** (legacy N-way cache path)
+- **CUDA backend** (N-way cache for `CuArray` wrappers)
 
 ```toml
 # LocalPreferences.toml
@@ -88,15 +92,13 @@ set_cache_ways!(8)
 # Restart Julia for changes to take effect
 ```
 
-**When to increase**: If your code alternates between more than 4 dimension patterns per pool slot, increase `cache_ways` to avoid cache eviction (~100 bytes header per miss).
-
-> **Scope**: `cache_ways` affects **all `unsafe_acquire!`** calls (including 1D). Only `acquire!` 1D uses simple 1:1 caching.
+**When to increase**: If your CUDA code or Julia 1.10 code alternates between more than 4 dimension patterns per pool slot, increase `cache_ways` to avoid cache eviction (~100 bytes header per miss).
 
 ## Summary
 
 | Setting | Scope | Restart? | Priority | Affects |
 |---------|-------|----------|----------|---------|
 | `use_pooling` | Compile-time | Yes | ⭐ Primary | All macros, `acquire!` behavior |
-| `cache_ways` | Compile-time | Yes | Advanced | `unsafe_acquire!` N-D caching |
+| `cache_ways` | Compile-time | Yes | Advanced | `unsafe_acquire!` N-D caching (Julia 1.10 / CUDA only) |
 | `MAYBE_POOLING_ENABLED` | Runtime | No | Optional | `@maybe_with_pool` only |
 | `POOL_DEBUG` | Runtime | No | Debug | Safety validation |