Skip to content
Merged
1 change: 1 addition & 0 deletions .codecov.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ coverage:

ignore:
- "ext/**/*"
- "src/legacy/**/*"

comment:
layout: "reach,diff,flags,files"
Expand Down
8 changes: 4 additions & 4 deletions Project.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
name = "AdaptiveArrayPools"
uuid = "4f381ef7-9af0-4cbe-99d4-cf36d7b0f233"
version = "0.2.1"
authors = ["Min-Gu Yoo <mgyoo86@gmail.com>"]
version = "0.2.1"

[deps]
Preferences = "21216c6a-2e73-6563-6e65-726566657250"
Expand All @@ -14,7 +14,7 @@ CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
AdaptiveArrayPoolsCUDAExt = "CUDA"

[compat]
julia = "1.10"
Preferences = "1"
CUDA = "5"
Printf = "1"
Preferences = "1"
Printf = "1"
julia = "1.10"
9 changes: 5 additions & 4 deletions docs/design/cuda_extension_design.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
# AdaptiveArrayPools.jl CUDA Extension Design

> **Status**: Draft v0.6 (Post-Review Revision)
> **Version**: 0.6
> **Date**: 2024-12-14
> **Authors**: Design discussion with AI assistance
> **Update (v0.2.2, feat/new_array_nd)**: The CPU path now uses `setfield!`-based wrapper
> reuse (Julia 1.11+) instead of the N-way cache for `unsafe_acquire!`. The **CUDA extension
> still uses the N-way set-associative cache** described in this document, since `CuArray`
> does not support `setfield!`-based field mutation. `CACHE_WAYS` and `set_cache_ways!` are
> now only relevant for the CUDA backend (and Julia 1.10 legacy CPU path).

## 1. Executive Summary

Expand Down
8 changes: 8 additions & 0 deletions docs/design/hybrid_api_design.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,13 @@
# Hybrid API Design: acquire! vs unsafe_acquire!

> **Update (v0.2.2, feat/new_array_nd)**: The `unsafe_acquire!` path no longer uses
> `unsafe_wrap` + N-way cache on Julia 1.11+ CPU. Instead, it uses `setfield!`-based
> wrapper reuse — **0-alloc for any number of dimension patterns** (no eviction limit).
> The N-way cache (`CACHE_WAYS`) is now only used by the **CUDA** backend and the
> **Julia 1.10 legacy** fallback. The `acquire!` → `ReshapedArray` path is unchanged.
> `TypedPool` fields changed: `nd_arrays`/`nd_dims`/`nd_ptrs`/`nd_next_way` →
> `nd_wrappers::Vector{Union{Nothing, Vector{Any}}}`.

## Executive Summary

Redesigning `AdaptiveArrayPools.jl`'s N-D array acquisition API with a **Two Tools Strategy**:
Expand Down
7 changes: 7 additions & 0 deletions docs/design/nd_array_approach_comparison.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,12 @@
# N-D Array Approach Comparison: unsafe_wrap vs ReshapedArray

> **Update (v0.2.2, feat/new_array_nd)**: The N-way set-associative cache described in this
> document has been **superseded on Julia 1.11+ CPU** by `setfield!`-based wrapper reuse
> (`nd_wrappers` indexed by dimensionality N). This achieves **0-alloc for unlimited dimension
> patterns** — no eviction, no `CACHE_WAYS` limit. The N-way cache remains in use for
> **CUDA** and the **Julia 1.10 legacy** path. The `acquire!` → `ReshapedArray` path is
> unchanged. See `src/acquire.jl` and `src/types.jl` for the current implementation.

## Summary

This document analyzes two approaches for returning N-dimensional arrays from AdaptiveArrayPools:
Expand Down
4 changes: 2 additions & 2 deletions docs/src/architecture/design-docs.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ For in-depth analysis of design decisions, implementation tradeoffs, and archite
## Caching & Performance

- **[nd_array_approach_comparison.md](https://github.com/ProjectTorreyPines/AdaptiveArrayPools.jl/blob/master/docs/design/nd_array_approach_comparison.md)**
N-way cache design, boxing analysis, and ReshapedArray benchmarks
N-way cache design (now legacy — replaced by `setfield!` reuse on Julia 1.11+ CPU), boxing analysis, and ReshapedArray benchmarks

- **[fixed_slots_codegen_design.md](https://github.com/ProjectTorreyPines/AdaptiveArrayPools.jl/blob/master/docs/design/fixed_slots_codegen_design.md)**
Zero-allocation iteration via `@generated` functions and fixed-slot type dispatch
Expand All @@ -32,7 +32,7 @@ For in-depth analysis of design decisions, implementation tradeoffs, and archite
| Document | Focus Area | Key Insights |
|----------|------------|--------------|
| hybrid_api_design | API strategy | View types for zero-alloc, Array for FFI |
| nd_array_approach_comparison | Caching | N-way associative cache reduces header allocation |
| nd_array_approach_comparison | Caching | N-way cache (legacy); setfield! reuse on Julia 1.11+ CPU |
| fixed_slots_codegen_design | Codegen | @generated functions enable type-stable iteration |
| untracked_acquire_design | Macro safety | Sentinel pattern ensures correct cleanup |
| cuda_extension_design | GPU support | Seamless CPU/CUDA API parity |
Expand Down
73 changes: 41 additions & 32 deletions docs/src/architecture/how-it-works.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,57 +82,66 @@ end

When you call `acquire!(pool, Float64, n)`, the compiler inlines directly to `pool.float64` — no dictionary lookup, no type instability.

## N-Way Set Associative Cache
## N-D Wrapper Reuse (CPU)

For `unsafe_acquire!` (which returns native `Array` types), we use an N-way cache to reduce header allocation:
For `unsafe_acquire!` (which returns native `Array` types), the caching strategy depends on the Julia version:

### Julia 1.11+: `setfield!`-based Wrapper Reuse (Zero-Allocation)

Julia 1.11 changed `Array` from an opaque C struct to a mutable Julia struct with `ref::MemoryRef{T}` and `size::NTuple{N,Int}` fields. This enables in-place mutation of cached `Array` wrappers via `setfield!`:

```
CACHE_WAYS = 4 (default)
┌────┬────┬────┬────┐
Slot 0 (Float64): │way0│way1│way2│way3│ ← round-robin eviction
└────┴────┴────┴────┘
┌────┬────┬────┬────┐
Slot 1 (Float32): │way0│way1│way2│way3│
└────┴────┴────┴────┘
...
nd_wrappers[N][slot] → cached Array{T,N}
├─ setfield!(:ref, new_memory_ref) ← update backing memory (0-alloc)
└─ setfield!(:size, new_dims) ← update dimensions (0-alloc)
```

### Cache Lookup Pseudocode
**Result**: Unlimited dimension patterns per slot with **zero allocation** after warmup. No eviction, no round-robin, no `CACHE_WAYS` limit.

```julia
# Pseudocode for Julia 1.11+ path
function unsafe_acquire!(pool, T, dims...)
typed_pool = get_typed_pool!(pool, T)
slot = n_active + 1
base = (slot - 1) * CACHE_WAYS

# Search all ways for matching dimensions
for k in 1:CACHE_WAYS
idx = base + k
if dims == typed_pool.nd_dims[idx]
# Cache hit! Check if underlying vector was resized
if pointer matches
return typed_pool.nd_arrays[idx]
end
end
flat_view = get_view!(typed_pool, prod(dims))
slot = typed_pool.n_active

# Direct index lookup by dimensionality N (~1ns)
wrapper = typed_pool.nd_wrappers[N][slot]
if wrapper !== nothing
setfield!(wrapper, :ref, getfield(vec, :ref)) # 0-alloc
setfield!(wrapper, :size, dims) # 0-alloc
return wrapper
end

# Cache miss: create new Array header, store in next way (round-robin)
way = typed_pool.nd_next_way[slot]
typed_pool.nd_next_way[slot] = (way + 1) % CACHE_WAYS
# ... create and cache Array ...
# First call for this (slot, N): unsafe_wrap once, cached forever
arr = wrap_array(typed_pool, flat_view, dims)
store_wrapper!(typed_pool, N, slot, arr)
return arr
end
```

**Key insight**: Even on cache miss, only the `Array` header (~80-144 bytes) is allocated. The actual data memory is always reused from the pool.
### Julia 1.10 (Legacy): N-Way Set Associative Cache

On Julia 1.10, `Array` fields cannot be mutated, so the legacy path uses a 4-way set-associative cache with round-robin eviction:

- Cache hit (≤`CACHE_WAYS` dimension patterns per slot): **0 bytes**
- Cache miss (>`CACHE_WAYS` patterns): **~80-144 bytes** per `unsafe_wrap` call

See [Configuration](../features/configuration.md) for `CACHE_WAYS` tuning (Julia 1.10 / CUDA only).

### CUDA: N-Way Cache

The CUDA backend still uses the N-way set-associative cache (same as Julia 1.10 legacy), since `CuArray` does not support `setfield!`-based mutation.

## View vs Array Return Types

Type stability is critical for performance. AdaptiveArrayPools provides two APIs:

| API | 1D Return | N-D Return | Allocation |
|-----|-----------|------------|------------|
| `acquire!` | `SubArray{T,1}` | `ReshapedArray{T,N}` | Always 0 bytes |
| `unsafe_acquire!` | `Vector{T}` | `Array{T,N}` | 0 bytes (hit) / ~100 bytes (miss) |
| API | 1D Return | N-D Return | Allocation (Julia 1.11+) | Allocation (Julia 1.10 / CUDA) |
|-----|-----------|------------|--------------------------|-------------------------------|
| `acquire!` | `SubArray{T,1}` | `ReshapedArray{T,N}` | Always 0 bytes | Always 0 bytes |
| `unsafe_acquire!` | `Vector{T}` | `Array{T,N}` | 0 bytes (setfield! reuse) | 0 bytes (hit) / ~100 bytes (miss) |

!!! note "`Bit` type behavior"
For `T === Bit`, both `acquire!` and `unsafe_acquire!` return native `BitVector` / `BitArray{N}` (not views). Cache hit achieves 0 bytes allocation.
Expand Down
68 changes: 25 additions & 43 deletions docs/src/architecture/type-dispatch.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,57 +21,39 @@ end

When you call `acquire!(pool, Float64, n)`, the compiler inlines directly to `pool.float64` - no dictionary lookup, no type instability.

## N-Way Set Associative Cache
## N-D Wrapper Caching for `unsafe_acquire!`

For `unsafe_acquire!` (which returns native `Array` types), we use an N-way cache to reduce header allocation:
`unsafe_acquire!` returns native `Array` types. The caching strategy depends on Julia version:

```
CACHE_WAYS = 4 (default)
+----+----+----+----+
Slot 0 (Float64): |way0|way1|way2|way3| <-- round-robin eviction
+----+----+----+----+
+----+----+----+----+
Slot 1 (Float32): |way0|way1|way2|way3|
+----+----+----+----+
...
```
### Julia 1.11+: `setfield!`-based Wrapper Reuse

### Cache Lookup Logic
Julia 1.11 made `Array` a mutable struct, enabling in-place field mutation:

```julia
function unsafe_acquire!(pool, T, dims...)
typed_pool = get_typed_pool!(pool, T)
slot = n_active + 1
base = (slot - 1) * CACHE_WAYS

# Search all ways for matching dimensions
for k in 1:CACHE_WAYS
idx = base + k
if dims == typed_pool.nd_dims[idx]
# Cache hit! Check if underlying vector was resized
if pointer matches
return typed_pool.nd_arrays[idx]
end
end
end

# Cache miss: create new Array header, store in next way (round-robin)
way = typed_pool.nd_next_way[slot]
typed_pool.nd_next_way[slot] = (way % CACHE_WAYS) + 1
# ... create and cache Array ...
end
# Cached wrapper reuse via setfield! (0-alloc)
setfield!(cached_arr, :ref, new_memory_ref) # update backing memory
setfield!(cached_arr, :size, new_dims) # update dimensions
```

**Key insight**: Even on cache miss, only the `Array` header (~80-144 bytes) is allocated. The actual data memory is always reused from the pool.
Wrappers are stored in `nd_wrappers::Vector{Union{Nothing, Vector{Any}}}`, indexed directly by dimensionality N (~1ns lookup). **Unlimited dimension patterns per slot, zero allocation after warmup.**

### Julia 1.10 / CUDA: N-Way Set Associative Cache

On Julia 1.10 (CPU) and CUDA, `Array`/`CuArray` fields cannot be mutated. These paths use a 4-way set-associative cache with round-robin eviction (`CACHE_WAYS = 4` default):

- **Cache hit** (≤4 dim patterns per slot): 0 bytes
- **Cache miss** (>4 patterns): ~80-144 bytes for Array header allocation

See [Configuration](../features/configuration.md) for `CACHE_WAYS` tuning.

---

## View vs Array: When to Use What?

| API | Return Type | Allocation | Recommended For |
|-----|-------------|------------|-----------------|
| `acquire!` | `SubArray` / `ReshapedArray` | **Always 0 bytes** | 99% of cases |
| `unsafe_acquire!` | `Vector` / `Array` | 0-144 bytes | FFI, type constraints |
| API | Return Type | Allocation (Julia 1.11+) | Allocation (1.10 / CUDA) | Recommended For |
|-----|-------------|--------------------------|--------------------------|-----------------|
| `acquire!` | `SubArray` / `ReshapedArray` | **Always 0 bytes** | **Always 0 bytes** | 99% of cases |
| `unsafe_acquire!` | `Vector` / `Array` | **0 bytes** (setfield! reuse) | 0-144 bytes (N-way cache) | FFI, type constraints |

### Why View is the Default

Expand Down Expand Up @@ -116,15 +98,15 @@ end

| Operation | acquire! (View) | unsafe_acquire! (Array) |
|-----------|-----------------|-------------------------|
| Allocation (cached) | 0 bytes | 0 bytes |
| Allocation (miss) | 0 bytes | 80-144 bytes |
| Allocation (Julia 1.11+) | 0 bytes | 0 bytes (setfield! reuse) |
| Allocation (Julia 1.10 / CUDA) | 0 bytes | 0 bytes (hit) / 80-144 bytes (miss) |
| BLAS operations | Identical | Identical |
| Type stability | Guaranteed | Guaranteed |
| FFI compatibility | Requires conversion | Direct |

### Header Size by Dimensionality
### Header Size by Dimensionality (Julia 1.10 / CUDA only)

When `unsafe_acquire!` has a cache miss:
On Julia 1.11+ CPU, `unsafe_acquire!` is always zero-allocation via `setfield!` reuse. On Julia 1.10 and CUDA, a cache miss allocates an `Array` header:

| Dimensions | Header Size |
|------------|-------------|
Expand Down
6 changes: 3 additions & 3 deletions docs/src/basics/api-essentials.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ end

### `unsafe_acquire!(pool, T, dims...)`

Returns a native `Array` type. **Zero-allocation on cache hit**—only allocates a small header (~80-144 bytes) on cache miss. Use when you specifically need `Array{T,N}`:
Returns a native `Array` type. On **Julia 1.11+**, always **zero-allocation** via `setfield!`-based wrapper reuse (unlimited dimension patterns). On Julia 1.10 and CUDA, zero-allocation on cache hit with a small header (~80-144 bytes) on cache miss. Use when you specifically need `Array{T,N}`:

```julia
@with_pool pool begin
Expand All @@ -36,7 +36,7 @@ end
```

!!! tip "Cache behavior"
Same dimension pattern → **0 bytes**. Different pattern → 80-144 bytes header only (data memory always reused). See [N-Way Cache](../architecture/type-dispatch.md#n-way-set-associative-cache) for details.
On Julia 1.11+: **always 0 bytes** regardless of dimension pattern (setfield!-based reuse). On Julia 1.10 / CUDA: same dimension pattern → 0 bytes, different pattern → 80-144 bytes header only (data always reused). See [N-D Wrapper Caching](../architecture/type-dispatch.md#n-d-wrapper-caching-for-unsafe_acquire) for details.

!!! note "`Bit` behavior"
For `T === Bit`, `unsafe_acquire!` is equivalent to `acquire!` and returns native `BitVector`/`BitArray{N}`.
Expand Down Expand Up @@ -113,7 +113,7 @@ end
| Function | Returns | Allocation | Use Case |
|----------|---------|------------|----------|
| `acquire!(pool, T, dims...)` | View type | 0 bytes | Default choice |
| `unsafe_acquire!(pool, T, dims...)` | `Array{T,N}` | 0 (hit) / 80-144 (miss) | FFI, type constraints |
| `unsafe_acquire!(pool, T, dims...)` | `Array{T,N}` | 0 bytes (1.11+) / 0-144 (1.10/CUDA) | FFI, type constraints |
| `zeros!(pool, [T,] dims...)` | View type | 0 bytes | Zero-initialized |
| `ones!(pool, [T,] dims...)` | View type | 0 bytes | One-initialized |
| `similar!(pool, A)` | View type | 0 bytes | Match existing array |
Expand Down
18 changes: 8 additions & 10 deletions docs/src/features/bit-arrays.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,16 +79,14 @@ Operations like `count()`, `sum()`, and bitwise broadcasting are **10x~100x fast

### N-D Caching & Zero Allocation

The pool uses an N-way associative cache to efficiently reuse `BitArray{N}` instances:
The pool reuses `BitArray{N}` wrapper instances via `setfield!`-based in-place mutation (Julia 1.11+) or N-way cache (Julia 1.10 / CUDA):

| Scenario | Allocation |
|----------|------------|
| First call with new dims | ~944 bytes (new `BitArray{N}` created) |
| Subsequent call with same dims | **0 bytes** (cached instance reused) |
| Same ndims, different dims | **0 bytes** (dims/len fields modified in-place) |
| Different ndims | ~944 bytes (new `BitArray{N}` created and cached) |
| Scenario | Julia 1.11+ | Julia 1.10 / CUDA |
|----------|-------------|-------------------|
| First call with new (slot, N) | ~944 bytes (new `BitArray{N}`) | ~944 bytes |
| Subsequent call, any dims | **0 bytes** (setfield! reuse) | **0 bytes** (same ndims) / ~944 bytes (different ndims) |

Unlike regular `Array` where dimensions are immutable, `BitArray` allows in-place modification of its `dims` and `len` fields. The pool exploits this to achieve **zero allocation** on repeated calls with matching dimensionality.
On Julia 1.11+, `BitArray` fields (`len`, `dims`, `chunks`) are mutated in-place via `setfield!`, achieving **zero allocation** on all repeated calls regardless of dimension pattern.

```julia
@with_pool pool begin
Expand All @@ -98,12 +96,12 @@ Unlike regular `Array` where dimensions are immutable, `BitArray` allows in-plac
# Rewind to reuse the same slot
rewind!(pool)

# Same dims: 0 allocation (exact cache hit)
# Same dims: 0 allocation (cached wrapper reused)
m2 = acquire!(pool, Bit, 100, 100)

rewind!(pool)

# Different dims but same ndims: 0 allocation (dims modified in-place)
# Different dims but same ndims: 0 allocation (fields updated in-place)
m3 = acquire!(pool, Bit, 50, 200)
end
```
Expand Down
14 changes: 8 additions & 6 deletions docs/src/features/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,9 +70,13 @@ POOL_DEBUG[] = false # Disable (default, production)

When enabled, returning a pool-backed array from a `@with_pool` block will throw an error.

## Compile-time: CACHE_WAYS
## Compile-time: CACHE_WAYS (Julia 1.10 / CUDA only)

Configure the N-way cache size for `unsafe_acquire!`. Higher values reduce cache eviction but increase memory per slot.
Configure the N-way cache size for `unsafe_acquire!`. **On Julia 1.11+ CPU, this setting has no effect** — the `setfield!`-based wrapper reuse supports unlimited dimension patterns with zero allocation.

This setting is relevant for:
- **Julia 1.10** (legacy N-way cache path)
- **CUDA backend** (N-way cache for `CuArray` wrappers)

```toml
# LocalPreferences.toml
Expand All @@ -88,15 +92,13 @@ set_cache_ways!(8)
# Restart Julia for changes to take effect
```

**When to increase**: If your code alternates between more than 4 dimension patterns per pool slot, increase `cache_ways` to avoid cache eviction (~100 bytes header per miss).

> **Scope**: `cache_ways` affects **all `unsafe_acquire!`** calls (including 1D). Only `acquire!` 1D uses simple 1:1 caching.
**When to increase**: If your CUDA code or Julia 1.10 code alternates between more than 4 dimension patterns per pool slot, increase `cache_ways` to avoid cache eviction (~100 bytes header per miss).

## Summary

| Setting | Scope | Restart? | Priority | Affects |
|---------|-------|----------|----------|---------|
| `use_pooling` | Compile-time | Yes | ⭐ Primary | All macros, `acquire!` behavior |
| `cache_ways` | Compile-time | Yes | Advanced | `unsafe_acquire!` N-D caching |
| `cache_ways` | Compile-time | Yes | Advanced | `unsafe_acquire!` N-D caching (Julia 1.10 / CUDA only) |
| `MAYBE_POOLING_ENABLED` | Runtime | No | Optional | `@maybe_with_pool` only |
| `POOL_DEBUG` | Runtime | No | Debug | Safety validation |
Loading