Skip to content

perf: combined kernel optimizations#13381

Draft
Kha wants to merge 9 commits intoleanprover:masterfrom
Kha:push-wmorskvrkxrm
Draft

perf: combined kernel optimizations#13381
Kha wants to merge 9 commits intoleanprover:masterfrom
Kha:push-wmorskvrkxrm

Conversation

@Kha
Copy link
Copy Markdown
Member

@Kha Kha commented Apr 12, 2026

No description provided.

Kha added 4 commits April 12, 2026 15:31
This PR turns `replace_rec_fn` into a class template parameterized on the
functor type and exposes it from `kernel/replace_fn.h`, alongside templated
`replace<F>(...)` overloads that monomorphize on the functor at the call
site. Hot callers like `instantiate`, `instantiate_rev`, and the C-exported
`lean_expr_instantiate*_core` variants in `kernel/instantiate.cpp` all pass
their substitution closure as a stateless lambda; previously each call into
the closure went through `std::function`'s indirect dispatch on every
recursion step, blocking inlining of the loose-bvar-range early return.
With this change the closure body inlines into `replace_rec_fn::apply` and
the `std::function` indirection is gone.

The legacy `replace(expr, std::function<...>, bool)` overload is preserved
as a thin out-of-line wrapper that instantiates the template once with
`std::function` as the functor type, so external callers continue to work
unchanged.

On `leanchecker --fresh Init.Data.List.Lemmas` this shaves
`17.10 G -> 16.50 G` instructions (~3.5%) and `1.74s -> 1.69s` wall-clock
(~3%). All existing callers benefit automatically since they were already
passing lambdas directly; no caller-side changes were required.
This PR skips inserting an entry into `replace_rec_fn`'s cache when the
result equals the input, i.e. when the user functor decided that the
subterm is unchanged. The dominant kernel callers of `replace` are the
`instantiate*` family in `kernel/instantiate.cpp`, and their substitution
closure short-circuits to `some_expr(m)` whenever
`offset >= get_loose_bvar_range(m)`. That predicate is true for the vast
majority of subterms in any non-trivial expression, so the cache was being
filled almost exclusively with `(node, offset) -> node` self-mappings.
Each such entry costs an insert and pollutes the cache for actually-useful
entries, while a re-visit can recompute the same answer by calling the
functor again at O(1) cost (the loose-bvar range is memoized on the expr
node).

On `leanchecker --fresh Init.Data.List.Lemmas` this shaves
`17.10 G -> 16.55 G` instructions (~3.2%) and `1.74s -> 1.61s` wall-clock
(~7.5%).
@Kha
Copy link
Copy Markdown
Member Author

Kha commented Apr 12, 2026

!bench

@leanprover-radar
Copy link
Copy Markdown

leanprover-radar commented Apr 12, 2026

Benchmark results for 6ff9c0b against 82bb27f are in. Significant changes detected! @Kha

  • 🟥 build//instructions: +8.5G (+0.07%)

Large changes (50✅, 18🟥)

Too many entries to display here. View the full report on radar instead.

Medium changes (67✅, 8🟥)

Too many entries to display here. View the full report on radar instead.

Small changes (1370✅, 38🟥)

Too many entries to display here. View the full report on radar instead.

This PR replaces the `r.raw() != e.raw()` skip-when-result-is-input check
in `replace_rec_fn::save_result` (introduced earlier in this stack) with
an opt-in pre-cache `skip` predicate that callers can supply alongside
the rewriter functor. The skip predicate is invoked *before* the cache
lookup; if it returns true, the input is returned unchanged with no
cache touch.

The earlier identity-skip approach blew up catastrophically on workloads
like `Init.WFExtrinsicFix` (~3-4x instruction count), because it dropped
caching for the *recursion-identity* case as well as the cheap
`m_f`-early-return case. In the recursion-identity case the cache had
been remembering "we already descended into this shared subterm and
proved it's unchanged" — without that memo, every revisit redid the
full recursive descent and the work compounded exponentially.

The skip predicate fixes this by separating the two cases: only the
cheap "no work needed for this subtree" guard goes into `skip`, and the
ordinary cache logic handles every other path unchanged. For
`instantiate*`/`lift_loose_bvars`/`lower_loose_bvars`, the predicate is
the existing memoized loose-bvar-range check that was previously the
first line of `m_f`. It's strictly cheaper than the cache lookup it
replaces.

Measurements vs the `xkn` baseline on `leanchecker --fresh`:

|                       | baseline | this PR |
|-----------------------|----------|---------|
| Init.WFExtrinsicFix   | 9.32 G   | 9.07 G  |
| Init.Data.List.Lemmas | 17.10 G  | 16.61 G |

Both workloads improve modestly, with no regression on either.
@github-actions github-actions bot added the toolchain-available A toolchain is available for this PR, at leanprover/lean4-pr-releases:pr-release-NNNN label Apr 12, 2026
@mathlib-lean-pr-testing
Copy link
Copy Markdown

mathlib-lean-pr-testing bot commented Apr 12, 2026

Mathlib CI status (docs):

  • ❗ Mathlib CI can not be attempted yet, as the nightly-testing-2026-04-09 tag does not exist there yet. We will retry when you push more commits. If you rebase your branch onto nightly-with-mathlib, Mathlib CI should run now. You can force Mathlib CI using the force-mathlib-ci label. (2026-04-12 17:56:15)
  • ❗ Batteries/Mathlib CI will not be attempted unless your PR branches off the nightly-with-mathlib branch. Try git rebase d53b46a0f3a7d0822550f4179ca3dab446a3c448 --onto d53b46a0f3a7d0822550f4179ca3dab446a3c448. You can force Mathlib CI using the force-mathlib-ci label. (2026-04-12 21:44:48)

@leanprover-bot
Copy link
Copy Markdown
Collaborator

leanprover-bot commented Apr 12, 2026

Reference manual CI status:

  • ❗ Reference manual CI can not be attempted yet, as the nightly-testing-2026-04-09 tag does not exist there yet. We will retry when you push more commits. If you rebase your branch onto nightly-with-manual, reference manual CI should run now. You can force reference manual CI using the force-manual-ci label. (2026-04-12 17:56:16)
  • ❗ Reference manual CI will not be attempted unless your PR branches off the nightly-with-manual branch. Try git rebase d53b46a0f3a7d0822550f4179ca3dab446a3c448 --onto 861bc19e0c1c45b5766cc5c9ef0488568368b236. You can force reference manual CI using the force-manual-ci label. (2026-04-12 21:44:50)
  • ❗ Reference manual CI can not be attempted yet, as the nightly-testing-2026-04-11 tag does not exist there yet. We will retry when you push more commits. If you rebase your branch onto nightly-with-manual, reference manual CI should run now. You can force reference manual CI using the force-manual-ci label. (2026-04-13 08:32:40)

@Kha Kha force-pushed the push-wmorskvrkxrm branch from 6ff9c0b to 6ddadf5 Compare April 12, 2026 18:22
@Kha
Copy link
Copy Markdown
Member Author

Kha commented Apr 12, 2026

!bench

@leanprover-radar
Copy link
Copy Markdown

leanprover-radar commented Apr 12, 2026

Benchmark results for 6ddadf5 against 82bb27f are in. Significant changes detected! @Kha

  • build//instructions: -309.0G (-2.56%)

Large changes (51✅, 1🟥)

Too many entries to display here. View the full report on radar instead.

Medium changes (84✅)

Too many entries to display here. View the full report on radar instead.

Small changes (1352✅, 2🟥)

Too many entries to display here. View the full report on radar instead.

Kha added 3 commits April 12, 2026 20:24
This PR replaces the `std::unordered_map`-based cache in `replace_rec_fn`
with a small-buffer cache that stores its first 16 entries inline in
uninitialized stack storage and only allocates a real hash map for the
rare large traversal. Instrumentation across a full
`leanchecker --fresh Init.Data.List.Lemmas` run shows that 87% of
`replace_rec_fn` instances hold at most 15 entries and only 0.21%
exceed 128, with a mean cache size of just 9 entries spread across
~950k instances. At that scale a hash map is the wrong data structure:
its per-instance bucket-array allocation and per-entry node allocation
dwarf the cost of a linear scan over a handful of entries.

The new structure pays nothing for entries that are never inserted, no
allocation at all on the common path, and falls back to the original
`unordered_map` once the inline buffer fills. Combined with the
existing `is_likely_unshared` filter, lookups on the common path are
just a tight scan over a stack-resident array.

On `leanchecker --fresh Init.Data.List.Lemmas` this shaves
`17.10 G -> 16.18 G` instructions (~5.4%) and `1.74s -> 1.62s`
wall-clock (~6.7%) compared to the previous baseline. It supersedes
the prior `try_emplace` and `reserve(128)` micro-optimizations on the
same cache, both of which are no longer needed since the hash map is
no longer on the hot path.
@Kha Kha force-pushed the push-wmorskvrkxrm branch from 6ddadf5 to 17e07a3 Compare April 12, 2026 20:59
@Kha
Copy link
Copy Markdown
Member Author

Kha commented Apr 12, 2026

!bench

@leanprover-radar
Copy link
Copy Markdown

leanprover-radar commented Apr 12, 2026

Benchmark results for 17e07a3 against d53b46a are in. Significant changes detected! @Kha

  • build//instructions: -304.3G (-2.54%)

Large changes (50✅)

Too many entries to display here. View the full report on radar instead.

Medium changes (78✅)

Too many entries to display here. View the full report on radar instead.

Small changes (1342✅, 2🟥)

Too many entries to display here. View the full report on radar instead.

@Kha
Copy link
Copy Markdown
Member Author

Kha commented Apr 12, 2026

!bench mathlib

@leanprover-radar
Copy link
Copy Markdown

leanprover-radar commented Apr 12, 2026

Benchmark results for leanprover-community/mathlib4-nightly-testing@8082e83 against leanprover-community/mathlib4-nightly-testing@bea414c are in. Significant changes detected! @Kha

  • build//instructions: -4.1T (-2.41%)

Large changes (2✅)

  • build//instructions: -4.1T (-2.41%)
  • build/profile/type checking//wall-clock: -3m 53s (-12.77%)

and 1 hidden

Medium changes (92✅)

Too many entries to display here. View the full report on radar instead.

Small changes (826✅, 3🟥)

Too many entries to display here. View the full report on radar instead.

mathlib-nightly-testing bot pushed a commit to leanprover-community/batteries that referenced this pull request Apr 13, 2026
@github-actions github-actions bot added the mathlib4-nightly-available A branch for this PR exists at leanprover-community/mathlib4-nightly-testing:lean-pr-testing-NNNN label Apr 13, 2026
mathlib-nightly-testing bot pushed a commit to leanprover-community/mathlib4-nightly-testing that referenced this pull request Apr 13, 2026
@mathlib-lean-pr-testing mathlib-lean-pr-testing bot added the builds-mathlib CI has verified that Mathlib builds against this PR label Apr 13, 2026
@mathlib-lean-pr-testing
Copy link
Copy Markdown

Mathlib CI status (docs):

@Kha
Copy link
Copy Markdown
Member Author

Kha commented Apr 13, 2026

!bench

@leanprover-radar
Copy link
Copy Markdown

leanprover-radar commented Apr 13, 2026

Benchmark results for b3beaf8 against 41ab492 are in. Significant changes detected! @Kha

  • build//instructions: -306.0G (-2.55%)

Large changes (52✅)

Too many entries to display here. View the full report on radar instead.

Medium changes (75✅)

Too many entries to display here. View the full report on radar instead.

Small changes (1331✅, 1🟥)

Too many entries to display here. View the full report on radar instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

builds-mathlib CI has verified that Mathlib builds against this PR mathlib4-nightly-available A branch for this PR exists at leanprover-community/mathlib4-nightly-testing:lean-pr-testing-NNNN toolchain-available A toolchain is available for this PR, at leanprover/lean4-pr-releases:pr-release-NNNN

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants