Skip to content

statistics: add combined TopN+histogram merge for global stats#66221

Draft
mjonss wants to merge 23 commits intopingcap:masterfrom
mjonss:improve-global-stats-1
Draft

statistics: add combined TopN+histogram merge for global stats#66221
mjonss wants to merge 23 commits intopingcap:masterfrom
mjonss:improve-global-stats-1

Conversation

@mjonss
Copy link
Contributor

@mjonss mjonss commented Feb 11, 2026

What problem does this PR solve?

Issue Number: ref #66220

Problem Summary:
Global stats merge after ANALYZE TABLE on partitioned tables is slow due to an O(P^2) TopN merge algorithm. MergePartTopN2GlobalTopN (the separate merge) iterates every other partition's histogram for each TopN value (FindTopN + EqualRowCount), which becomes a bottleneck as partition count grows. For column histograms, EqualRowCount returns totalRows/NDV — a uniform estimate that inflates counts for values appearing inside histogram buckets. The subsequent BinarySearchRemoveVal calls mutate histograms before they are merged, reducing histogram quality downstream.

What changed and how does it work?

Combined TopN+histogram merge (MergePartTopNAndHistToGlobal)

Merges partition TopN lists and histograms in a single pass. It sums TopN counts across partitions (O(T*P) via hash map, no histogram lookups) and extracts Repeat counts from histogram bucket upper bounds into the TopN counter during bucket collection. This catches the "spread value" scenario (values in some partitions' TopN and at histogram upper bounds in others) without the separate merge's inflation problem, at zero extra passes over the data.

Key design:

  • Extract mergeHistBucketsToGlobalHist shared helper from MergePartitionHist2GlobalHist to reuse the ~200-line bucket merge core
  • During histogram bucket collection, move each bucket's upper-bound Repeat into the TopN counter and zero it in the bucket
  • Skip recalculate-repeats in the shared helper (pass nil) since repeats are already handled during extraction
  • Wire into callers: mergeConcurrency==0 uses the new combined merge, otherwise falls through to the existing separate/concurrent flow

Gating: The new combined path is enabled by setting tidb_merge_partition_stats_concurrency = 0. The default remains 1 (separate sequential). Setting it to >=2 uses separate concurrent — both existing paths are unchanged.

Full-pipeline benchmark results (TopN + histogram merge, 500 TopN values per partition):

  ┌────────────┬──────────┬──────────┬──────────┬─────────┐
  │ Partitions │ Separate │ Concur.  │ Combined │ Speedup │
  ├────────────┼──────────┼──────────┼──────────┼─────────┤
  │        100 │   18.9ms │   21.0ms │    4.1ms │    4.5x │
  │      1,000 │    340ms │    353ms │   36.8ms │      9x │
  │      2,000 │    750ms │    783ms │   72.4ms │     10x │
  │      5,000 │    2.11s │    1.98s │    189ms │     10x │
  │     10,000 │    4.85s │    3.71s │    432ms │  9–11x  │
  └────────────┴──────────┴──────────┴──────────┴─────────┘

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

Add a faster combined global statistics merge for partitioned tables that avoids the separate merge's O(P^2) histogram lookups and count inflation. Enable by setting `tidb_merge_partition_stats_concurrency = 0`. Achieves 4–11x speedup for the full TopN + histogram merge pipeline.

… concurrency=0

The existing TopN merge (MergePartTopN2GlobalTopN) performs O(P^2)
histogram lookups for each TopN value across all partitions, which is
slow for tables with many partitions. The histogram estimates it retrieves
(totalRows/NDV for columns) add noise rather than accuracy, and the
subsequent BinarySearchRemoveVal calls mutate histograms before they
are merged, reducing histogram quality.

Add MergePartTopN2GlobalTopNV2 which merges TopN counts using a simple
hash map without consulting histograms. This is O(T*P) and leaves
histograms unmodified for the subsequent histogram merge step.

Gate the new path behind tidb_merge_partition_stats_concurrency=0
(new default). Setting it to 1+ falls back to existing V1 paths.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ti-chi-bot
Copy link

ti-chi-bot bot commented Feb 11, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-tests-checked release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. component/statistics sig/planner SIG: Planner labels Feb 11, 2026
@tiprow
Copy link

tiprow bot commented Feb 11, 2026

Hi @mjonss. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@mjonss
Copy link
Contributor Author

mjonss commented Feb 11, 2026

/ok-to-test

@ti-chi-bot ti-chi-bot bot added the ok-to-test Indicates a PR is ready to be tested. label Feb 11, 2026
@codecov
Copy link

codecov bot commented Feb 11, 2026

Codecov Report

❌ Patch coverage is 61.49425% with 67 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.2140%. Comparing base (e67d561) to head (ca057d6).
⚠️ Report is 11 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #66221        +/-   ##
================================================
+ Coverage   77.7081%   78.2140%   +0.5059%     
================================================
  Files          2004       1937        -67     
  Lines        547284     536319     -10965     
================================================
- Hits         425284     419477      -5807     
+ Misses       120340     116398      -3942     
+ Partials       1660        444      -1216     
Flag Coverage Δ
integration 44.0736% <19.5402%> (-4.1144%) ⬇️
unit 76.6250% <61.4942%> (+0.2656%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 56.7974% <ø> (ø)
parser ∅ <ø> (∅)
br 48.7893% <ø> (-12.0891%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ti-chi-bot ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 12, 2026
@mjonss mjonss marked this pull request as ready for review February 12, 2026 00:31
@ti-chi-bot ti-chi-bot bot removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/needs-tests-checked labels Feb 12, 2026
mjonss and others added 7 commits February 12, 2026 01:40
…stAndTopN

Add tests that compare V1 and V2 TopN merge behavior and accuracy:
- Comparison tests with hand-crafted histograms showing V1's histogram
  inflation vs V2's TopN-only counts, including full histogram merge
- Accuracy test using BuildHistAndTopN (the real ANALYZE pipeline) to
  build partition stats from known data, then comparing estimated counts
  against true counts for both V1 and V2

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
V1 accumulates NotNullCount/NDV uniform estimates from every partition
where a TopN value falls inside a histogram bucket. With 20 partitions
this inflates val_rare from true=120 to 1920 (16x), causing V1 to pick
the wrong value for the global TopN. V2 counts TopN entries only and
correctly picks val_common (200, exact).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…d histogram sizes

All four V1/V2 comparison tests now use consistent, realistic stats:
- 2 TopN entries per partition (every partition has the same count)
- 4 histogram buckets per partition
- globalTopN=2

Updated tests:
- TestMergeTopNV1V2ComparisonHistValueNotAtUpperBound
- TestMergeTopNV1V2ComparisonHistValueAtUpperBound
- TestMergeTopNV2BetterThanV1ManyPartitionsInflation (renamed)
- TestMergeTopNV1BetterThanV2SpreadValue

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Verify that MergePartTopN2GlobalTopNV2 never loses or creates counts:
globalTopN.TotalCount() + sum(leftTopN counts) == sum(input TopN counts).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ti-chi-bot ti-chi-bot bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 12, 2026
@mjonss mjonss requested a review from Copilot February 12, 2026 16:41
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new “TopN-only” global TopN merge path (V2) for partitioned-table global statistics to avoid O(P²) histogram lookups/mutations, and makes it the default via tidb_merge_partition_stats_concurrency = 0.

Changes:

  • Add MergePartTopN2GlobalTopNV2 and route global TopN merge to it when tidb_merge_partition_stats_concurrency=0.
  • Update default and validation range for tidb_merge_partition_stats_concurrency to allow 0 and make it the new default.
  • Add extensive tests + benchmarks for V2 and adjust existing expectations.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
pkg/statistics/handle/globalstats/topn.go Adds V2 TopN-only merge and switches behavior when merge concurrency is 0.
pkg/sessionctx/vardef/tidb_vars.go Changes default tidb_merge_partition_stats_concurrency from 1 to 0.
pkg/sessionctx/variable/sysvar.go Allows sysvar min value 0 so the new default/mode is valid.
pkg/statistics/handle/globalstats/topn_test.go Adds broad unit tests comparing V1 vs V2 behavior and invariants.
pkg/statistics/handle/globalstats/topn_bench_test.go Adds benchmark coverage for the V2 merge.
pkg/statistics/handle/globalstats/global_stats_test.go Adjusts/extends integration tests for new merge modes, adds new test.
pkg/statistics/handle/globalstats/global_stats_internal_test.go Splits Issue #24349 test into parts and adds a V2 expectation path.
pkg/statistics/handle/handletest/handle_test.go Updates expected global TopN results under new behavior.
pkg/statistics/handle/globalstats/BUILD.bazel Updates test sharding and adds a dependency for new tests.

mjonss and others added 2 commits February 12, 2026 17:42
Keep V1 as the default TopN merge strategy. Users can opt in to V2
by setting tidb_merge_partition_stats_concurrency = 0.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…und repeats

V2 (MergePartTopN2GlobalTopNV2) merges TopN purely from TopN lists,
avoiding V1's O(P^2) histogram lookups. However, V2 underestimates
global TopN counts for values that appear in some partitions' TopN
and at other partitions' histogram bucket upper bounds (the "spread
value" scenario).

Add MergePartTopNAndHistToGlobal which combines V2's TopN-only merge
with extraction of significant Repeat counts from histogram bucket
upper bounds during bucket collection. This catches the spread value
scenario without V1's inflation problem, at zero extra passes.

Key design:
- Extract mergeHistBucketsToGlobalHist helper from
  MergePartitionHist2GlobalHist to share the ~200-line bucket merge
  core between the old and new paths
- During histogram bucket collection, move each bucket's upper-bound
  Repeat into the TopN counter and zero it in the bucket
- Skip recalculate-repeats in the shared helper (pass nil) since
  repeats are already handled during extraction
- Wire into callers: mergeConcurrency==0 uses the new combined merge,
  otherwise falls through to the existing V1/concurrent flow

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ti-chi-bot
Copy link

ti-chi-bot bot commented Feb 12, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign benmeadowcroft, wddevries for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mjonss and others added 3 commits February 13, 2026 00:29
…d hybrid merge

Add BenchmarkFullPipeline with sub-benchmarks for all three merge
approaches (V1, Concurrent, Hybrid), each running the complete
TopN + histogram merge pipeline. This enables direct comparison
since the existing TopN-only benchmarks excluded the histogram
merge step.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mjonss
Copy link
Contributor Author

mjonss commented Feb 13, 2026

Added new benchmark for full merge of both TopN and Histograms, results:

goos: darwin
goarch: arm64
pkg: github.com/pingcap/tidb/pkg/statistics/handle/globalstats
cpu: Apple M3 Max
BenchmarkFullPipeline/Size100/V1-16         	      73	  16651097 ns/op	  11670843 B/op	  152195 allocs/op
BenchmarkFullPipeline/Size100/Concurrent-16 	      58	  18415642 ns/op	  11801191 B/op	  152248 allocs/op
BenchmarkFullPipeline/Size100/Hybrid-16     	     285	   4084806 ns/op	   2582601 B/op	   52579 allocs/op
BenchmarkFullPipeline/Size1000/V1-16        	       4	 305600271 ns/op	 133413962 B/op	 1784681 allocs/op
BenchmarkFullPipeline/Size1000/Concurrent-16           4	 309953864 ns/op	 134767476 B/op	 1784762 allocs/op
BenchmarkFullPipeline/Size1000/Hybrid-16              31	  36142910 ns/op	  25173858 B/op	  507080 allocs/op
BenchmarkFullPipeline/Size2000/V1-16                   2	 699503000 ns/op	 305494868 B/op	 4129758 allocs/op
BenchmarkFullPipeline/Size2000/Concurrent-16           2	 651071270 ns/op	 307606892 B/op	 4129821 allocs/op
BenchmarkFullPipeline/Size2000/Hybrid-16              14	  75254164 ns/op	  56250289 B/op	 1091534 allocs/op
BenchmarkFullPipeline/Size5000/V1-16                   1	1923521375 ns/op	 961367768 B/op	13133361 allocs/op
BenchmarkFullPipeline/Size5000/Concurrent-16           1	1672338125 ns/op	 978180880 B/op	13133435 allocs/op
BenchmarkFullPipeline/Size5000/Hybrid-16               5	 220651975 ns/op	 174129582 B/op	 2527110 allocs/op
BenchmarkFullPipeline/Size10000/V1-16                  1	4547595042 ns/op	1913918368 B/op	26263403 allocs/op
BenchmarkFullPipeline/Size10000/Concurrent-16          1	3645793875 ns/op	1922382112 B/op	26263508 allocs/op
BenchmarkFullPipeline/Size10000/Hybrid-16              3	 501890792 ns/op	 695332008 B/op	10041902 allocs/op

So both faster and using less memory and less allocations

@mjonss mjonss changed the title statistics: add TopN-only merge (V2) for global stats atistics: add hybrid TopN+histogram merge (V2) for global stats Feb 13, 2026
@mjonss mjonss changed the title atistics: add hybrid TopN+histogram merge (V2) for global stats statistics: add hybrid TopN+histogram merge (V2) for global stats Feb 13, 2026
mjonss and others added 8 commits February 13, 2026 09:09
…s load timeout

The test was intermittently failing in CI because show stats_topn reads
from the in-memory stats cache, which uses lazy loading with a default
sync wait of only 100ms. Under CI load this timeout can expire before
all column/index stats are loaded, producing incorrect results.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix typo "since since both a and be is" → "since both a and b are"
- Handle codec.EncodeKey errors with require.NoError instead of ignoring
- Use session scope for tidb_merge_partition_stats_concurrency in tests
  instead of global scope to avoid potential cross-test interference
- Merge testIssues24349Part1/Part2/Part2V2 back into single
  testIssues24349 since V1 and hybrid produce identical results
- Improve comments with accurate merge explanation and annotated output

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The V2 TopN-only merge function had no production callers — only tests
and benchmarks. The production path uses MergePartTopNAndHistToGlobal
(hybrid, concurrency=0) or MergePartTopN2GlobalTopN (V1, concurrency>=1).

Remove the function, its dedicated tests, V2 sections from V1-vs-V2
comparison tests, and its benchmark.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…st helper

Subtract extracted Repeat counts from totCount when moving histogram
upper-bound Repeats into the TopN counter. Without this, totCount was
over-estimated by the sum of extracted Repeats that ended up in
globalTopN, causing slightly suboptimal histogram bucket splitting.

Also remove unused findTopNMetaCountByEncoded test helper left behind
after V2 test removal.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace version-based naming (V1, V2, Hybrid) with descriptive terms:
- "Separate" for the two-phase TopN-then-histogram merge flow
- "Combined" for the single-pass TopN+histogram merge

This avoids confusion since "V2" was already used for the concurrent
merge path and for analyze version 2.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…pdate

Replace the unreliable explain + tidb_stats_load_sync_wait approach with
dom.StatsHandle().Update() to force a synchronous full stats cache
refresh from storage after analyze. This is the standard pattern used
by other stats tests and avoids CI timeouts with large datasets.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mjonss mjonss changed the title statistics: add hybrid TopN+histogram merge (V2) for global stats statistics: add combined TopN+histogram merge for global stats Feb 13, 2026
@mjonss mjonss marked this pull request as draft February 15, 2026 22:49
@ti-chi-bot ti-chi-bot bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 15, 2026
@mjonss
Copy link
Contributor Author

mjonss commented Feb 15, 2026

Investigating if #66289 is the better solution, i.e. build the TopN and Histograms from samples gathered from the partitions, using the same input as the TopN and Histograms for the partitions itself. That should be both more efficient as well as more accurate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/statistics do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. ok-to-test Indicates a PR is ready to be tested. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/planner SIG: Planner size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments