perf: small-buffer inline cache for replace_rec_fn#13378
Draft
Kha wants to merge 1 commit intoleanprover:masterfrom
Draft
perf: small-buffer inline cache for replace_rec_fn#13378Kha wants to merge 1 commit intoleanprover:masterfrom
replace_rec_fn#13378Kha wants to merge 1 commit intoleanprover:masterfrom
Conversation
Member
Author
|
!bench |
|
Benchmark results for 9398a0b against 82bb27f are in. Significant changes detected! @Kha
Large changes (11✅, 5🟥)
Medium changes (28✅, 3🟥) Too many entries to display here. View the full report on radar instead. Small changes (895✅, 29🟥) Too many entries to display here. View the full report on radar instead. |
|
Mathlib CI status (docs):
|
Collaborator
|
Reference manual CI status:
|
This PR replaces the `std::unordered_map`-based cache in `replace_rec_fn` with a small-buffer cache that stores its first 16 entries inline in uninitialized stack storage and only allocates a real hash map for the rare large traversal. Instrumentation across a full `leanchecker --fresh Init.Data.List.Lemmas` run shows that 87% of `replace_rec_fn` instances hold at most 15 entries and only 0.21% exceed 128, with a mean cache size of just 9 entries spread across ~950k instances. At that scale a hash map is the wrong data structure: its per-instance bucket-array allocation and per-entry node allocation dwarf the cost of a linear scan over a handful of entries. The new structure pays nothing for entries that are never inserted, no allocation at all on the common path, and falls back to the original `unordered_map` once the inline buffer fills. Combined with the existing `is_likely_unshared` filter, lookups on the common path are just a tight scan over a stack-resident array. On `leanchecker --fresh Init.Data.List.Lemmas` this shaves `17.10 G -> 16.18 G` instructions (~5.4%) and `1.74s -> 1.62s` wall-clock (~6.7%) compared to the previous baseline. It supersedes the prior `try_emplace` and `reserve(128)` micro-optimizations on the same cache, both of which are no longer needed since the hash map is no longer on the hot path.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR replaces the
std::unordered_map-based cache inreplace_rec_fnwith a small-buffer cache that stores its first 16 entries inline in uninitialized stack storage and only allocates a real hash map for the rare large traversal. Instrumentation across a fullleanchecker --fresh Init.Data.List.Lemmasrun shows that 87% ofreplace_rec_fninstances hold at most 15 entries and only 0.21% exceed 128, with a mean cache size of just 9 entries spread across ~950k instances. At that scale a hash map is the wrong data structure: its per-instance bucket-array allocation and per-entry node allocation dwarf the cost of a linear scan over a handful of entries.The new structure pays nothing for entries that are never inserted, no allocation at all on the common path, and falls back to the original
unordered_maponce the inline buffer fills. Combined with the existingis_likely_unsharedfilter, lookups on the common path are just a tight scan over a stack-resident array.On
leanchecker --fresh Init.Data.List.Lemmasthis shaves17.10 G -> 16.18 Ginstructions (~5.4%) and1.74s -> 1.62swall-clock (~6.7%) compared to the previous baseline. It supersedes the priortry_emplaceandreserve(128)micro-optimizations on the same cache, both of which are no longer needed since the hash map is no longer on the hot path.