turbo-tasks: Reduce allocations on cache hits#92756
Conversation
Merging this PR will not alter performance
Comparing Footnotes
|
Stats from current PR✅ No significant changes detected📊 All Metrics📖 Metrics GlossaryDev Server Metrics:
Build Metrics:
Change Thresholds:
⚡ Dev Server
📦 Dev Server (Webpack) (Legacy)📦 Dev Server (Webpack)
⚡ Production Builds
📦 Production Builds (Webpack) (Legacy)📦 Production Builds (Webpack)
📦 Bundle SizesBundle Sizes⚡ TurbopackClient Main Bundles
Server Middleware
Build DetailsBuild Manifests
📦 WebpackClient Main Bundles
Polyfills
Pages
Server Edge SSR
Middleware
Build DetailsBuild Manifests
Build Cache
🔄 Shared (bundler-independent)Runtimes
📎 Tarball URL |
Tests Passed |
|
Rename MagicAny to DynTaskInput or RawTaskInputs ... in another PR |
Two changes to reduce heap allocations when calling turbo-tasks functions: 1. Move persistent_task_type propagation from connect_child/IncreaseActiveCount into initialize_new_task. This removes the need to thread task_type through operations on every call (hit or miss), and lets connect_child use TaskDataCategory::Meta instead of All. 2. Add a fast-path cache lookup (try_native_call / native_call_if_consistent) that checks the task_cache with borrowed args before boxing. The macro- generated code now tries this read-only lookup first for non-self function calls. On cache hit (~85% of calls), no Box<dyn MagicAny> is allocated. On miss, falls back to the existing boxed path. Co-Authored-By: Claude <noreply@anthropic.com>
- Replace redundant closures with `RawVc::TaskOutput` (clippy) - Return `Err(0)` from VcStorage::try_native_call instead of unreachable!(), since the testing backend has no task cache - Fall back to dynamic_call (not native_call) on cache miss, since dynamic_call is the universal entry point all backends implement Co-Authored-By: Claude <noreply@anthropic.com>
- Extract native_fn before Arc::new(task_type) to avoid an extra .clone()
in the Vacant arms of get_or_create_{persistent,transient}_task
- Add track_cache_miss_by_fn (mirrors track_cache_hit_by_fn)
- Remove explanatory comments about persistent_task_type eagerness
- Remove unused persistence() method instead of suppressing warning
Co-Authored-By: Claude <noreply@anthropic.com>
The static_block codegen for method calls (self/this pointer) now uses the same optimized path as free functions: args stay on the stack and we try a read-only cache lookup before boxing. For methods, we additionally check this.is_resolved() before taking the fast path, since unresolved self values need a resolution wrapper task. Co-Authored-By: Claude <noreply@anthropic.com>
Instead of expanding each macro callsite into two code paths (one for cache hit, one for miss), introduce a StackArg trait that keeps args on the caller's stack. The backend does a read-only cache lookup with a borrowed &dyn MagicAny reference; only on cache miss does take_box() move the value to the heap — zero clones, single code path per callsite. Key changes: - Add StackArg trait + StackArgSlot<T> (stack slot) + OwnedArg (boxed adapter) - dynamic_call/native_call now take &mut dyn StackArg instead of Box<dyn MagicAny> - Backend::get_or_create_*_task takes components (native_fn, this, &mut dyn StackArg) and does raw_get with borrowed arg before materializing the Box on miss - Remove try_native_call, native_call_if_consistent, try_get_or_create_* - Macro static_block reduces to a single dynamic_call with StackArgSlot Co-Authored-By: Claude <noreply@anthropic.com>
- Comment 4+5: Restore `persistence()` helper, use it in both `static_block` and `dynamic_block` to reduce diff from canary - Comment 6: Make `trait_call` take `&mut dyn StackArg` too, so `dynamic_block` (trait dispatch) also uses `StackArgSlot` instead of `Box::new(inputs)` — deferred boxing on trait calls - Comment 2: Merge `get_or_create_persistent_task` and `get_or_create_transient_task` into shared `get_or_create_task_inner` parameterized by `transient: bool` - Comment 1: Construct `CachedTaskType` in the transient panic path so `panic_persistent_calling_transient` gets a real task description Co-Authored-By: Claude <noreply@anthropic.com>
Replace the two-phase lookup (read-lock raw_get then write-lock raw_entry with re-hash) with a single raw_entry_with_hash call that takes the pre-computed hash and a heterogeneous eq closure. The map is sharded so write-lock contention is minimal, and this eliminates redundant hashing on the miss path. Co-Authored-By: Claude <noreply@anthropic.com>
…ring backing storage read The single raw_entry_with_hash approach held the dashmap write lock while calling task_by_type (backing storage). Restore the three-step flow: raw_get (read lock) -> task_by_type (no lock) -> raw_entry_with_hash (write lock), but now the write-lock step reuses the pre-computed hash instead of re-hashing. Co-Authored-By: Claude <noreply@anthropic.com>
- Delete arc_or_owned.rs (no longer referenced after ArcOrOwned removal) - Remove or_insert_with, get_mut, into_mut, RefMut and its Deref/DerefMut impls from dash_map_raw_entry (none used by current callers) - VacantEntry::insert now returns () since no caller used RefMut - Mark panic_persistent_calling_transient as -> ! to make the divergence contract explicit Co-Authored-By: Claude <noreply@anthropic.com>
…MagicAny Address review comments: - Rename StackArg -> StackMagicAny, StackArgSlot -> StackMagicAnySlot, OwnedArg -> OwnedMagicAny, arg_ref -> as_ref - FilterOwnedArgsFunctor now takes &mut dyn StackMagicAny and returns OwnedMagicAny, so the caller doesn't manually take_box + rewrap Co-Authored-By: Claude <noreply@anthropic.com>
…eate_task The Backend trait had two methods with identical signatures that only differed by transience. The caller just matched on persistence and dispatched. Merge into a single method that accepts TaskPersistence, eliminating the redundant trait surface. Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
…inline variable - Thread 14: Restore comment explaining why read lock is used for Step 1 - Thread 16: Restore descriptive comments on backing storage path - Thread 19: Remove StackMagicAny doc comment from TurboTasksCallApi - Thread 20: Inline parent_task local variable in native_call Co-Authored-By: Claude <noreply@anthropic.com>
When restoring a task from backing storage, reuse the existing Arc<CachedTaskType> from the stored persistent_task_type rather than creating a new Arc from the caller's boxed copy. This avoids having two copies of the same task_type in memory. Co-Authored-By: Claude <noreply@anthropic.com>
…ents Verify that hash_from_components produces the same hash as the Hash impl on a fully constructed CachedTaskType, and that eq_components correctly matches/rejects on each component (native_fn, this, arg). Co-Authored-By: Claude <noreply@anthropic.com>
Compute the shard index once from the hash and reuse it for both the read-only cache check and the subsequent write-lock entry lookup. Saves a few math operations and a pointer dereference on the miss path. Co-Authored-By: Claude <noreply@anthropic.com>
Guarantees same layout as Option<T>, making the type suitable for FFI-like patterns and ensuring no padding overhead. Co-Authored-By: Claude <noreply@anthropic.com>
Enable dashmap's raw-api feature to access shard internals directly. get_shard() returns a reference to the shard itself, which is reused across both read-only and write-lock lookups, eliminating redundant shard index computation. Co-Authored-By: Claude <noreply@anthropic.com>
Rework task_by_type and lookup_task_candidates to accept exploded components (native_fn, this, &dyn MagicAny) instead of &CachedTaskType. This allows the backing storage lookup to happen using borrowed references from the stack — the Box<dyn MagicAny> allocation for the arg is now deferred until both the in-memory cache AND backing storage have confirmed a miss. Co-Authored-By: Claude <noreply@anthropic.com>
- Restore "another thread beat us" comment in Occupied race path - Restore "Initialize storage BEFORE making task_id visible" ordering invariant - Restore "insert() consumes e, releasing the shard write lock" - Fix stale connect_child.rs comments about removed task type update - Restore "stay Meta not All" performance rationale in aggregation_update - Improve error message in kv_backing_storage to include this parameter Co-Authored-By: Claude <noreply@anthropic.com>
8b69577 to
59875f4
Compare
mmastrac
left a comment
There was a problem hiding this comment.
LGTM. I suspect that we may actually be able to remove the virtual methods on StackMagicAny in a followup, assuming that Rust isn't smart enough to devirtualize them itself with some additional tricks.
|
This is the 'explicitly capture layout information and vtables idea? |
I missed that Rust already captures layout inside the vtable already, so if you 1) have a You'd have something like this (hand-wavey): |

What?
Reduce heap allocations when turbo-tasks functions get cache hits (~85% of calls).
Why?
Every turbo-tasks function call (generated by
#[turbo_tasks::function]) was boxing its arguments intoBox<dyn MagicAny>before looking up the task cache. This allocation is wasted on cache hits, which are the overwhelmingly common case.How?
Deferred boxing via
StackMagicAnytrait object:Introduce a
StackMagicAnytrait that abstracts over a stack-residentOption<T>:as_ref(&self) -> &dyn MagicAny— borrow the argument for hash/equality (cache lookup)take_box(&mut self) -> Box<dyn MagicAny>— move the value to the heap (zero clones)as_any_mut(&mut self) -> &mut dyn Any— downcast to concrete type without boxingThe data flow:
StackMagicAnySlot::new((args...))on the stack, callsdynamic_call(..., &mut arg)dynamic_call: checks resolution viaarg.as_ref(), routes tonative_call(resolved) or boxes viaarg.take_box()forLocalTaskSpec(unresolved)get_or_create_task_inner: does a read-onlyraw_getlookup usinghash_from_components+eq_componentswith the borrowed&dyn MagicAny. On cache hit (~85%), returns immediately — zero heap allocation. On cache miss, re-checks under write lock using the same borrowed reference, and only callsarg.take_box()in the vacant entry case (true cache miss).Boxing is now deferred past all of these:
filter_ownedis nowOption<FilterOwnedArgsFunctor>; whenNone(the common case where all args are used), the original&mut dyn StackMagicAnypasses straight through todynamic_callwithout boxingOptimized
filter_ownedfor traits:When trait methods do need argument filtering (unused
_-prefixed parameters), the old path didtake_box()→downcast_args_owned()→ dereference → repack. This is an unnecessary heap round-trip. The newdowncast_stack_args_owned()function usesas_any_mut()to downcast directly to&mut StackMagicAnySlot<T>and callstake()on the innerOption, skipping the intermediateBoxentirely.Additional changes:
Backend::get_or_create_*_tasknow takes decomposed parameters (native_fn,this,&mut dyn StackMagicAny) instead of a pre-constructedCachedTaskTypeget_or_create_task_inner(transient: bool)connect_childuses eagerly-setpersistent_task_typefrominitialize_new_taskOwnedMagicAnyadapter wraps already-boxed args (from async resolution tasks) to fit theStackMagicAnyinterfacedynamic_callandtrait_calltake&mut dyn StackMagicAny(trait dispatch also benefits)CachedTaskType::hash_encodenow delegates tohash_encode_components(deduplicated)try_native_call,native_call_if_consistent,try_get_or_create_*— the deferred boxing approach subsumes theseBinary size
Binary size is neutral (linux-x86_64,
--release, stripped + gzipped: 30.9 MB on both canary and this branch).Overhead benchmark (turbo-tasks-backend, median, lower is better)
Measured on an isolated Firecracker microVM (linux-x86_64). Variance is nontrivial on this environment, but the direction is consistently positive across all turbo-tasks benchmarks.
turbo-cached-same-keys/1turbo-cached-same-keys/10turbo-cached-same-keys/100turbo-cached-same-keys/1000turbo-cached-different-keys/1turbo-cached-different-keys/10turbo-cached-different-keys/100turbo-cached-different-keys/1000turbo-uncached/1turbo-uncached/10turbo-uncached/100turbo-uncached/1000turbo-uncached-parallel/1turbo-uncached-parallel/10turbo-uncached-parallel/100turbo-uncached-parallel/1000