perf: specialize instantiate for the n=1 case#13385
Closed
Kha wants to merge 10 commits intoleanprover:masterfrom
Closed
perf: specialize instantiate for the n=1 case#13385Kha wants to merge 10 commits intoleanprover:masterfrom
instantiate for the n=1 case#13385Kha wants to merge 10 commits intoleanprover:masterfrom
Conversation
This PR turns `replace_rec_fn` into a class template parameterized on the functor type and exposes it from `kernel/replace_fn.h`, alongside templated `replace<F>(...)` overloads that monomorphize on the functor at the call site. Hot callers like `instantiate`, `instantiate_rev`, and the C-exported `lean_expr_instantiate*_core` variants in `kernel/instantiate.cpp` all pass their substitution closure as a stateless lambda; previously each call into the closure went through `std::function`'s indirect dispatch on every recursion step, blocking inlining of the loose-bvar-range early return. With this change the closure body inlines into `replace_rec_fn::apply` and the `std::function` indirection is gone. The legacy `replace(expr, std::function<...>, bool)` overload is preserved as a thin out-of-line wrapper that instantiates the template once with `std::function` as the functor type, so external callers continue to work unchanged. On `leanchecker --fresh Init.Data.List.Lemmas` this shaves `17.10 G -> 16.50 G` instructions (~3.5%) and `1.74s -> 1.69s` wall-clock (~3%). All existing callers benefit automatically since they were already passing lambdas directly; no caller-side changes were required.
This PR skips inserting an entry into `replace_rec_fn`'s cache when the result equals the input, i.e. when the user functor decided that the subterm is unchanged. The dominant kernel callers of `replace` are the `instantiate*` family in `kernel/instantiate.cpp`, and their substitution closure short-circuits to `some_expr(m)` whenever `offset >= get_loose_bvar_range(m)`. That predicate is true for the vast majority of subterms in any non-trivial expression, so the cache was being filled almost exclusively with `(node, offset) -> node` self-mappings. Each such entry costs an insert and pollutes the cache for actually-useful entries, while a re-visit can recompute the same answer by calling the functor again at O(1) cost (the loose-bvar range is memoized on the expr node). On `leanchecker --fresh Init.Data.List.Lemmas` this shaves `17.10 G -> 16.55 G` instructions (~3.2%) and `1.74s -> 1.61s` wall-clock (~7.5%).
This PR replaces the `r.raw() != e.raw()` skip-when-result-is-input check in `replace_rec_fn::save_result` (introduced earlier in this stack) with an opt-in pre-cache `skip` predicate that callers can supply alongside the rewriter functor. The skip predicate is invoked *before* the cache lookup; if it returns true, the input is returned unchanged with no cache touch. The earlier identity-skip approach blew up catastrophically on workloads like `Init.WFExtrinsicFix` (~3-4x instruction count), because it dropped caching for the *recursion-identity* case as well as the cheap `m_f`-early-return case. In the recursion-identity case the cache had been remembering "we already descended into this shared subterm and proved it's unchanged" — without that memo, every revisit redid the full recursive descent and the work compounded exponentially. The skip predicate fixes this by separating the two cases: only the cheap "no work needed for this subtree" guard goes into `skip`, and the ordinary cache logic handles every other path unchanged. For `instantiate*`/`lift_loose_bvars`/`lower_loose_bvars`, the predicate is the existing memoized loose-bvar-range check that was previously the first line of `m_f`. It's strictly cheaper than the cache lookup it replaces. Measurements vs the `xkn` baseline on `leanchecker --fresh`: | | baseline | this PR | |-----------------------|----------|---------| | Init.WFExtrinsicFix | 9.32 G | 9.07 G | | Init.Data.List.Lemmas | 17.10 G | 16.61 G | Both workloads improve modestly, with no regression on either.
This PR replaces the `std::unordered_map`-based cache in `replace_rec_fn` with a small-buffer cache that stores its first 16 entries inline in uninitialized stack storage and only allocates a real hash map for the rare large traversal. Instrumentation across a full `leanchecker --fresh Init.Data.List.Lemmas` run shows that 87% of `replace_rec_fn` instances hold at most 15 entries and only 0.21% exceed 128, with a mean cache size of just 9 entries spread across ~950k instances. At that scale a hash map is the wrong data structure: its per-instance bucket-array allocation and per-entry node allocation dwarf the cost of a linear scan over a handful of entries. The new structure pays nothing for entries that are never inserted, no allocation at all on the common path, and falls back to the original `unordered_map` once the inline buffer fills. Combined with the existing `is_likely_unshared` filter, lookups on the common path are just a tight scan over a stack-resident array. On `leanchecker --fresh Init.Data.List.Lemmas` this shaves `17.10 G -> 16.18 G` instructions (~5.4%) and `1.74s -> 1.62s` wall-clock (~6.7%) compared to the previous baseline. It supersedes the prior `try_emplace` and `reserve(128)` micro-optimizations on the same cache, both of which are no longer needed since the hash map is no longer on the hot path.
This PR adds a dedicated `instantiate_one_core` helper that handles the single-substitution case directly, and dispatches to it from all of the `instantiate*` entry points when `n == 1` (and, in the general `instantiate(a, s, n, subst)` variant, when `s == 0`). Input-shape instrumentation across a typical workload (`leanchecker --fresh Init.Data.List.Lemmas`, ~690k `instantiate*` calls) shows that `s == 0` in 100% of calls and `n == 1` in ~62% — so this specialized path handles the plurality of real work. The specialized closure collapses the generic version's overflow-guarded arithmetic, array indirection, and two-branch `vidx >= s1` / `vidx < h` check to a single equality test plus a shift-down fallthrough, tightening the hot path and reducing codegen size. Standalone against the `xkn` baseline this only shaves ~30 M instructions (~0.17%) on `Init.Data.List.Lemmas`, because the `std::function` dispatch dominates per-node cost and drowns out the closure-body savings. The improvement compounds meaningfully when combined with the templated `replace_rec_fn` dispatch on the combined-perf stack — there the same change saves ~305 M instructions (~2.1%), because the specialized body is fully inlined into `replace_rec_fn::apply`. Rev variant: `instantiate_rev` with `n=1` is identical to forward `instantiate` with `n=1` (the "reverse" index calculation collapses), so it routes through the same helper.
Member
Author
|
!bench |
|
Benchmark results for ebbdfca against 41ab492 are in. Significant changes detected! @Kha
Large changes (53✅) Too many entries to display here. View the full report on radar instead. Medium changes (80✅) Too many entries to display here. View the full report on radar instead. Small changes (1335✅, 1🟥) Too many entries to display here. View the full report on radar instead. |
Collaborator
|
Reference manual CI status:
|
mathlib-nightly-testing bot
pushed a commit
to leanprover-community/batteries
that referenced
this pull request
Apr 13, 2026
mathlib-nightly-testing bot
pushed a commit
to leanprover-community/mathlib4-nightly-testing
that referenced
this pull request
Apr 13, 2026
|
Mathlib CI status (docs):
|
Member
Author
|
Noise against #13381 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds a dedicated
instantiate_one_corehelper that handles thesingle-substitution case directly, and dispatches to it from all of the
instantiate*entry points whenn == 1(and, in the generalinstantiate(a, s, n, subst)variant, whens == 0).Input-shape instrumentation across a typical workload
(
leanchecker --fresh Init.Data.List.Lemmas, ~690kinstantiate*calls)shows that
s == 0in 100% of calls andn == 1in ~62% — so thisspecialized path handles the plurality of real work. The specialized
closure collapses the generic version's overflow-guarded arithmetic,
array indirection, and two-branch
vidx >= s1/vidx < hcheck to asingle equality test plus a shift-down fallthrough, tightening the hot
path and reducing codegen size.
Standalone against the
xknbaseline this only shaves ~30 M instructions(~0.17%) on
Init.Data.List.Lemmas, because thestd::functiondispatch dominates per-node cost and drowns out the closure-body savings.
The improvement compounds meaningfully when combined with the templated
replace_rec_fndispatch on the combined-perf stack — there the samechange saves ~305 M instructions (~2.1%), because the specialized body
is fully inlined into
replace_rec_fn::apply.Rev variant:
instantiate_revwithn=1is identical to forwardinstantiatewithn=1(the "reverse" index calculation collapses),so it routes through the same helper.