perf: specialize `instantiate` for the `n=1` case by Kha · Pull Request #13385 · leanprover/lean4

Kha · 2026-04-13T08:37:18Z

This PR adds a dedicated instantiate_one_core helper that handles the
single-substitution case directly, and dispatches to it from all of the
instantiate* entry points when n == 1 (and, in the general
instantiate(a, s, n, subst) variant, when s == 0).

Input-shape instrumentation across a typical workload
(leanchecker --fresh Init.Data.List.Lemmas, ~690k instantiate* calls)
shows that s == 0 in 100% of calls and n == 1 in ~62% — so this
specialized path handles the plurality of real work. The specialized
closure collapses the generic version's overflow-guarded arithmetic,
array indirection, and two-branch vidx >= s1 / vidx < h check to a
single equality test plus a shift-down fallthrough, tightening the hot
path and reducing codegen size.

Standalone against the xkn baseline this only shaves ~30 M instructions
(~0.17%) on Init.Data.List.Lemmas, because the std::function
dispatch dominates per-node cost and drowns out the closure-body savings.
The improvement compounds meaningfully when combined with the templated
replace_rec_fn dispatch on the combined-perf stack — there the same
change saves ~305 M instructions (~2.1%), because the specialized body
is fully inlined into replace_rec_fn::apply.

Rev variant: instantiate_rev with n=1 is identical to forward
instantiate with n=1 (the "reverse" index calculation collapses),
so it routes through the same helper.

This PR turns `replace_rec_fn` into a class template parameterized on the functor type and exposes it from `kernel/replace_fn.h`, alongside templated `replace<F>(...)` overloads that monomorphize on the functor at the call site. Hot callers like `instantiate`, `instantiate_rev`, and the C-exported `lean_expr_instantiate*_core` variants in `kernel/instantiate.cpp` all pass their substitution closure as a stateless lambda; previously each call into the closure went through `std::function`'s indirect dispatch on every recursion step, blocking inlining of the loose-bvar-range early return. With this change the closure body inlines into `replace_rec_fn::apply` and the `std::function` indirection is gone. The legacy `replace(expr, std::function<...>, bool)` overload is preserved as a thin out-of-line wrapper that instantiates the template once with `std::function` as the functor type, so external callers continue to work unchanged. On `leanchecker --fresh Init.Data.List.Lemmas` this shaves `17.10 G -> 16.50 G` instructions (~3.5%) and `1.74s -> 1.69s` wall-clock (~3%). All existing callers benefit automatically since they were already passing lambdas directly; no caller-side changes were required.

This PR skips inserting an entry into `replace_rec_fn`'s cache when the result equals the input, i.e. when the user functor decided that the subterm is unchanged. The dominant kernel callers of `replace` are the `instantiate*` family in `kernel/instantiate.cpp`, and their substitution closure short-circuits to `some_expr(m)` whenever `offset >= get_loose_bvar_range(m)`. That predicate is true for the vast majority of subterms in any non-trivial expression, so the cache was being filled almost exclusively with `(node, offset) -> node` self-mappings. Each such entry costs an insert and pollutes the cache for actually-useful entries, while a re-visit can recompute the same answer by calling the functor again at O(1) cost (the loose-bvar range is memoized on the expr node). On `leanchecker --fresh Init.Data.List.Lemmas` this shaves `17.10 G -> 16.55 G` instructions (~3.2%) and `1.74s -> 1.61s` wall-clock (~7.5%).

This PR replaces the `r.raw() != e.raw()` skip-when-result-is-input check in `replace_rec_fn::save_result` (introduced earlier in this stack) with an opt-in pre-cache `skip` predicate that callers can supply alongside the rewriter functor. The skip predicate is invoked *before* the cache lookup; if it returns true, the input is returned unchanged with no cache touch. The earlier identity-skip approach blew up catastrophically on workloads like `Init.WFExtrinsicFix` (~3-4x instruction count), because it dropped caching for the *recursion-identity* case as well as the cheap `m_f`-early-return case. In the recursion-identity case the cache had been remembering "we already descended into this shared subterm and proved it's unchanged" — without that memo, every revisit redid the full recursive descent and the work compounded exponentially. The skip predicate fixes this by separating the two cases: only the cheap "no work needed for this subtree" guard goes into `skip`, and the ordinary cache logic handles every other path unchanged. For `instantiate*`/`lift_loose_bvars`/`lower_loose_bvars`, the predicate is the existing memoized loose-bvar-range check that was previously the first line of `m_f`. It's strictly cheaper than the cache lookup it replaces. Measurements vs the `xkn` baseline on `leanchecker --fresh`: | | baseline | this PR | |-----------------------|----------|---------| | Init.WFExtrinsicFix | 9.32 G | 9.07 G | | Init.Data.List.Lemmas | 17.10 G | 16.61 G | Both workloads improve modestly, with no regression on either.

This PR replaces the `std::unordered_map`-based cache in `replace_rec_fn` with a small-buffer cache that stores its first 16 entries inline in uninitialized stack storage and only allocates a real hash map for the rare large traversal. Instrumentation across a full `leanchecker --fresh Init.Data.List.Lemmas` run shows that 87% of `replace_rec_fn` instances hold at most 15 entries and only 0.21% exceed 128, with a mean cache size of just 9 entries spread across ~950k instances. At that scale a hash map is the wrong data structure: its per-instance bucket-array allocation and per-entry node allocation dwarf the cost of a linear scan over a handful of entries. The new structure pays nothing for entries that are never inserted, no allocation at all on the common path, and falls back to the original `unordered_map` once the inline buffer fills. Combined with the existing `is_likely_unshared` filter, lookups on the common path are just a tight scan over a stack-resident array. On `leanchecker --fresh Init.Data.List.Lemmas` this shaves `17.10 G -> 16.18 G` instructions (~5.4%) and `1.74s -> 1.62s` wall-clock (~6.7%) compared to the previous baseline. It supersedes the prior `try_emplace` and `reserve(128)` micro-optimizations on the same cache, both of which are no longer needed since the hash map is no longer on the hot path.

This PR adds a dedicated `instantiate_one_core` helper that handles the single-substitution case directly, and dispatches to it from all of the `instantiate*` entry points when `n == 1` (and, in the general `instantiate(a, s, n, subst)` variant, when `s == 0`). Input-shape instrumentation across a typical workload (`leanchecker --fresh Init.Data.List.Lemmas`, ~690k `instantiate*` calls) shows that `s == 0` in 100% of calls and `n == 1` in ~62% — so this specialized path handles the plurality of real work. The specialized closure collapses the generic version's overflow-guarded arithmetic, array indirection, and two-branch `vidx >= s1` / `vidx < h` check to a single equality test plus a shift-down fallthrough, tightening the hot path and reducing codegen size. Standalone against the `xkn` baseline this only shaves ~30 M instructions (~0.17%) on `Init.Data.List.Lemmas`, because the `std::function` dispatch dominates per-node cost and drowns out the closure-body savings. The improvement compounds meaningfully when combined with the templated `replace_rec_fn` dispatch on the combined-perf stack — there the same change saves ~305 M instructions (~2.1%), because the specialized body is fully inlined into `replace_rec_fn::apply`. Rev variant: `instantiate_rev` with `n=1` is identical to forward `instantiate` with `n=1` (the "reverse" index calculation collapses), so it routes through the same helper.

Kha · 2026-04-13T08:38:50Z

!bench

leanprover-radar · 2026-04-13T08:38:56Z

Benchmark results for ebbdfca against 41ab492 are in. Significant changes detected! @Kha

✅ build//instructions: -307.1G (-2.56%)

Large changes (53✅)

Too many entries to display here. View the full report on radar instead.

Medium changes (80✅)

Too many entries to display here. View the full report on radar instead.

Small changes (1335✅, 1🟥)

Too many entries to display here. View the full report on radar instead.

leanprover-bot · 2026-04-13T09:20:36Z

Reference manual CI status:

❗ Reference manual CI can not be attempted yet, as the nightly-testing-2026-04-11 tag does not exist there yet. We will retry when you push more commits. If you rebase your branch onto nightly-with-manual, reference manual CI should run now. You can force reference manual CI using the force-manual-ci label. (2026-04-13 09:20:35)

mathlib-lean-pr-testing · 2026-04-13T10:17:51Z

Mathlib CI status (docs):

✅ Mathlib branch lean-pr-testing-13385 has successfully built against this PR. (2026-04-13 10:17:49) View Log

Kha · 2026-04-13T10:18:30Z

Noise against #13381

Kha added 10 commits April 12, 2026 15:31

perf: skip kernel sharecommon on theorems

0ba525b

move into mkAuxTheorem

6ec325d

perf: combined kernel optimizations

000cc4b

merge upstream/nightly-with-mathlib

17e07a3

merge upstream/nightly-with-mathlib

b3beaf8

Kha requested a review from leodemoura as a code owner April 13, 2026 08:37

Kha marked this pull request as draft April 13, 2026 08:37

github-actions bot added the toolchain-available A toolchain is available for this PR, at leanprover/lean4-pr-releases:pr-release-NNNN label Apr 13, 2026

mathlib-nightly-testing bot pushed a commit to leanprover-community/batteries that referenced this pull request Apr 13, 2026

Update lean-toolchain for testing leanprover/lean4#13385

5240904

github-actions bot added the mathlib4-nightly-available A branch for this PR exists at leanprover-community/mathlib4-nightly-testing:lean-pr-testing-NNNN label Apr 13, 2026

mathlib-nightly-testing bot pushed a commit to leanprover-community/mathlib4-nightly-testing that referenced this pull request Apr 13, 2026

Update lean-toolchain for testing leanprover/lean4#13385

d5717b3

mathlib-lean-pr-testing bot added the builds-mathlib CI has verified that Mathlib builds against this PR label Apr 13, 2026

Kha closed this Apr 13, 2026

Kha deleted the push-qtwvmrkqstnz branch April 14, 2026 12:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: specialize `instantiate` for the `n=1` case#13385

perf: specialize `instantiate` for the `n=1` case#13385
Kha wants to merge 10 commits intoleanprover:masterfrom
Kha:push-qtwvmrkqstnz

Kha commented Apr 13, 2026 •

edited

Loading

Uh oh!

Kha commented Apr 13, 2026

Uh oh!

leanprover-radar commented Apr 13, 2026 •

edited

Loading

Uh oh!

leanprover-bot commented Apr 13, 2026

Uh oh!

mathlib-lean-pr-testing bot commented Apr 13, 2026

Uh oh!

Kha commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Kha commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kha commented Apr 13, 2026

Uh oh!

leanprover-radar commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leanprover-bot commented Apr 13, 2026

Uh oh!

mathlib-lean-pr-testing bot commented Apr 13, 2026

Uh oh!

Kha commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Kha commented Apr 13, 2026 •

edited

Loading

leanprover-radar commented Apr 13, 2026 •

edited

Loading