statistics: add combined TopN+histogram merge for global stats by mjonss · Pull Request #66221 · pingcap/tidb

mjonss · 2026-02-11T23:06:25Z

What problem does this PR solve?

Issue Number: ref #66220

Problem Summary:
Global stats merge after ANALYZE TABLE on partitioned tables is slow due to an O(P^2) TopN merge algorithm. MergePartTopN2GlobalTopN (the separate merge) iterates every other partition's histogram for each TopN value (FindTopN + EqualRowCount), which becomes a bottleneck as partition count grows. For column histograms, EqualRowCount returns totalRows/NDV — a uniform estimate that inflates counts for values appearing inside histogram buckets. The subsequent BinarySearchRemoveVal calls mutate histograms before they are merged, reducing histogram quality downstream.

What changed and how does it work?

Combined TopN+histogram merge (MergePartTopNAndHistToGlobal)

Merges partition TopN lists and histograms in a single pass. It sums TopN counts across partitions (O(T*P) via hash map, no histogram lookups) and extracts Repeat counts from histogram bucket upper bounds into the TopN counter during bucket collection. This catches the "spread value" scenario (values in some partitions' TopN and at histogram upper bounds in others) without the separate merge's inflation problem, at zero extra passes over the data.

Key design:

Extract mergeHistBucketsToGlobalHist shared helper from MergePartitionHist2GlobalHist to reuse the ~200-line bucket merge core
During histogram bucket collection, move each bucket's upper-bound Repeat into the TopN counter and zero it in the bucket
Skip recalculate-repeats in the shared helper (pass nil) since repeats are already handled during extraction
Wire into callers: mergeConcurrency==0 uses the new combined merge, otherwise falls through to the existing separate/concurrent flow

Gating: The new combined path is enabled by setting tidb_merge_partition_stats_concurrency = 0. The default remains 1 (separate sequential). Setting it to >=2 uses separate concurrent — both existing paths are unchanged.

Full-pipeline benchmark results (TopN + histogram merge, 500 TopN values per partition):

  ┌────────────┬──────────┬──────────┬──────────┬─────────┐
  │ Partitions │ Separate │ Concur.  │ Combined │ Speedup │
  ├────────────┼──────────┼──────────┼──────────┼─────────┤
  │        100 │   18.9ms │   21.0ms │    4.1ms │    4.5x │
  │      1,000 │    340ms │    353ms │   36.8ms │      9x │
  │      2,000 │    750ms │    783ms │   72.4ms │     10x │
  │      5,000 │    2.11s │    1.98s │    189ms │     10x │
  │     10,000 │    4.85s │    3.71s │    432ms │  9–11x  │
  └────────────┴──────────┴──────────┴──────────┴─────────┘

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No need to test
- I checked and no code files have been changed.

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

Add a faster combined global statistics merge for partitioned tables that avoids the separate merge's O(P^2) histogram lookups and count inflation. Enable by setting `tidb_merge_partition_stats_concurrency = 0`. Achieves 4–11x speedup for the full TopN + histogram merge pipeline.

… concurrency=0 The existing TopN merge (MergePartTopN2GlobalTopN) performs O(P^2) histogram lookups for each TopN value across all partitions, which is slow for tables with many partitions. The histogram estimates it retrieves (totalRows/NDV for columns) add noise rather than accuracy, and the subsequent BinarySearchRemoveVal calls mutate histograms before they are merged, reducing histogram quality. Add MergePartTopN2GlobalTopNV2 which merges TopN counts using a simple hash map without consulting histograms. This is O(T*P) and leaves histograms unmodified for the subsequent histogram merge step. Gate the new path behind tidb_merge_partition_stats_concurrency=0 (new default). Setting it to 1+ falls back to existing V1 paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ti-chi-bot · 2026-02-11T23:06:29Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

tiprow · 2026-02-11T23:06:42Z

Hi @mjonss. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

mjonss · 2026-02-11T23:07:09Z

/ok-to-test

pkg/statistics/handle/globalstats/global_stats_test.go

codecov · 2026-02-11T23:30:43Z

Codecov Report

❌ Patch coverage is 61.49425% with 67 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.2140%. Comparing base (e67d561) to head (ca057d6).
⚠️ Report is 11 commits behind head on master.

Additional details and impacted files

@@               Coverage Diff                @@
##             master     #66221        +/-   ##
================================================
+ Coverage   77.7081%   78.2140%   +0.5059%     
================================================
  Files          2004       1937        -67     
  Lines        547284     536319     -10965     
================================================
- Hits         425284     419477      -5807     
+ Misses       120340     116398      -3942     
+ Partials       1660        444      -1216

Flag	Coverage Δ
integration	`44.0736% <19.5402%> (-4.1144%)`	⬇️
unit	`76.6250% <61.4942%> (+0.2656%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
dumpling	`56.7974% <ø> (ø)`
parser	`∅ <ø> (∅)`
br	`48.7893% <ø> (-12.0891%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…stAndTopN Add tests that compare V1 and V2 TopN merge behavior and accuracy: - Comparison tests with hand-crafted histograms showing V1's histogram inflation vs V2's TopN-only counts, including full histogram merge - Accuracy test using BuildHistAndTopN (the real ANALYZE pipeline) to build partition stats from known data, then comparing estimated counts against true counts for both V1 and V2 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

V1 accumulates NotNullCount/NDV uniform estimates from every partition where a TopN value falls inside a histogram bucket. With 20 partitions this inflates val_rare from true=120 to 1920 (16x), causing V1 to pick the wrong value for the global TopN. V2 counts TopN entries only and correctly picks val_common (200, exact). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…d histogram sizes All four V1/V2 comparison tests now use consistent, realistic stats: - 2 TopN entries per partition (every partition has the same count) - 4 histogram buckets per partition - globalTopN=2 Updated tests: - TestMergeTopNV1V2ComparisonHistValueNotAtUpperBound - TestMergeTopNV1V2ComparisonHistValueAtUpperBound - TestMergeTopNV2BetterThanV1ManyPartitionsInflation (renamed) - TestMergeTopNV1BetterThanV2SpreadValue Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Verify that MergePartTopN2GlobalTopNV2 never loses or creates counts: globalTopN.TotalCount() + sum(leftTopN counts) == sum(input TopN counts). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

This PR introduces a new “TopN-only” global TopN merge path (V2) for partitioned-table global statistics to avoid O(P²) histogram lookups/mutations, and makes it the default via tidb_merge_partition_stats_concurrency = 0.

Changes:

Add MergePartTopN2GlobalTopNV2 and route global TopN merge to it when tidb_merge_partition_stats_concurrency=0.
Update default and validation range for tidb_merge_partition_stats_concurrency to allow 0 and make it the new default.
Add extensive tests + benchmarks for V2 and adjust existing expectations.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
pkg/statistics/handle/globalstats/topn.go	Adds V2 TopN-only merge and switches behavior when merge concurrency is 0.
pkg/sessionctx/vardef/tidb_vars.go	Changes default `tidb_merge_partition_stats_concurrency` from 1 to 0.
pkg/sessionctx/variable/sysvar.go	Allows sysvar min value 0 so the new default/mode is valid.
pkg/statistics/handle/globalstats/topn_test.go	Adds broad unit tests comparing V1 vs V2 behavior and invariants.
pkg/statistics/handle/globalstats/topn_bench_test.go	Adds benchmark coverage for the V2 merge.
pkg/statistics/handle/globalstats/global_stats_test.go	Adjusts/extends integration tests for new merge modes, adds new test.
pkg/statistics/handle/globalstats/global_stats_internal_test.go	Splits Issue #24349 test into parts and adds a V2 expectation path.
pkg/statistics/handle/handletest/handle_test.go	Updates expected global TopN results under new behavior.
pkg/statistics/handle/globalstats/BUILD.bazel	Updates test sharding and adds a dependency for new tests.

pkg/statistics/handle/globalstats/global_stats_internal_test.go

pkg/statistics/handle/globalstats/topn_test.go

pkg/statistics/handle/globalstats/global_stats_internal_test.go

pkg/statistics/handle/globalstats/global_stats_test.go

Keep V1 as the default TopN merge strategy. Users can opt in to V2 by setting tidb_merge_partition_stats_concurrency = 0. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…und repeats V2 (MergePartTopN2GlobalTopNV2) merges TopN purely from TopN lists, avoiding V1's O(P^2) histogram lookups. However, V2 underestimates global TopN counts for values that appear in some partitions' TopN and at other partitions' histogram bucket upper bounds (the "spread value" scenario). Add MergePartTopNAndHistToGlobal which combines V2's TopN-only merge with extraction of significant Repeat counts from histogram bucket upper bounds during bucket collection. This catches the spread value scenario without V1's inflation problem, at zero extra passes. Key design: - Extract mergeHistBucketsToGlobalHist helper from MergePartitionHist2GlobalHist to share the ~200-line bucket merge core between the old and new paths - During histogram bucket collection, move each bucket's upper-bound Repeat into the TopN counter and zero it in the bucket - Skip recalculate-repeats in the shared helper (pass nil) since repeats are already handled during extraction - Wire into callers: mergeConcurrency==0 uses the new combined merge, otherwise falls through to the existing V1/concurrent flow Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ti-chi-bot · 2026-02-12T23:22:46Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign benmeadowcroft, wddevries for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

pkg/sessionctx/variable/OWNERS
pkg/statistics/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…d hybrid merge Add BenchmarkFullPipeline with sub-benchmarks for all three merge approaches (V1, Concurrent, Hybrid), each running the complete TopN + histogram merge pipeline. This enables direct comparison since the existing TopN-only benchmarks excluded the histogram merge step. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mjonss · 2026-02-13T02:28:07Z

Added new benchmark for full merge of both TopN and Histograms, results:

goos: darwin
goarch: arm64
pkg: github.com/pingcap/tidb/pkg/statistics/handle/globalstats
cpu: Apple M3 Max
BenchmarkFullPipeline/Size100/V1-16         	      73	  16651097 ns/op	  11670843 B/op	  152195 allocs/op
BenchmarkFullPipeline/Size100/Concurrent-16 	      58	  18415642 ns/op	  11801191 B/op	  152248 allocs/op
BenchmarkFullPipeline/Size100/Hybrid-16     	     285	   4084806 ns/op	   2582601 B/op	   52579 allocs/op
BenchmarkFullPipeline/Size1000/V1-16        	       4	 305600271 ns/op	 133413962 B/op	 1784681 allocs/op
BenchmarkFullPipeline/Size1000/Concurrent-16           4	 309953864 ns/op	 134767476 B/op	 1784762 allocs/op
BenchmarkFullPipeline/Size1000/Hybrid-16              31	  36142910 ns/op	  25173858 B/op	  507080 allocs/op
BenchmarkFullPipeline/Size2000/V1-16                   2	 699503000 ns/op	 305494868 B/op	 4129758 allocs/op
BenchmarkFullPipeline/Size2000/Concurrent-16           2	 651071270 ns/op	 307606892 B/op	 4129821 allocs/op
BenchmarkFullPipeline/Size2000/Hybrid-16              14	  75254164 ns/op	  56250289 B/op	 1091534 allocs/op
BenchmarkFullPipeline/Size5000/V1-16                   1	1923521375 ns/op	 961367768 B/op	13133361 allocs/op
BenchmarkFullPipeline/Size5000/Concurrent-16           1	1672338125 ns/op	 978180880 B/op	13133435 allocs/op
BenchmarkFullPipeline/Size5000/Hybrid-16               5	 220651975 ns/op	 174129582 B/op	 2527110 allocs/op
BenchmarkFullPipeline/Size10000/V1-16                  1	4547595042 ns/op	1913918368 B/op	26263403 allocs/op
BenchmarkFullPipeline/Size10000/Concurrent-16          1	3645793875 ns/op	1922382112 B/op	26263508 allocs/op
BenchmarkFullPipeline/Size10000/Hybrid-16              3	 501890792 ns/op	 695332008 B/op	10041902 allocs/op

So both faster and using less memory and less allocations

…s load timeout The test was intermittently failing in CI because show stats_topn reads from the in-memory stats cache, which uses lazy loading with a default sync wait of only 100ms. Under CI load this timeout can expire before all column/index stats are loaded, producing incorrect results. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Fix typo "since since both a and be is" → "since both a and b are" - Handle codec.EncodeKey errors with require.NoError instead of ignoring - Use session scope for tidb_merge_partition_stats_concurrency in tests instead of global scope to avoid potential cross-test interference - Merge testIssues24349Part1/Part2/Part2V2 back into single testIssues24349 since V1 and hybrid produce identical results - Improve comments with accurate merge explanation and annotated output Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The V2 TopN-only merge function had no production callers — only tests and benchmarks. The production path uses MergePartTopNAndHistToGlobal (hybrid, concurrency=0) or MergePartTopN2GlobalTopN (V1, concurrency>=1). Remove the function, its dedicated tests, V2 sections from V1-vs-V2 comparison tests, and its benchmark. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…st helper Subtract extracted Repeat counts from totCount when moving histogram upper-bound Repeats into the TopN counter. Without this, totCount was over-estimated by the sum of extracted Repeats that ended up in globalTopN, causing slightly suboptimal histogram bucket splitting. Also remove unused findTopNMetaCountByEncoded test helper left behind after V2 test removal. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace version-based naming (V1, V2, Hybrid) with descriptive terms: - "Separate" for the two-phase TopN-then-histogram merge flow - "Combined" for the single-pass TopN+histogram merge This avoids confusion since "V2" was already used for the concurrent merge path and for analyze version 2. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…pdate Replace the unreliable explain + tidb_stats_load_sync_wait approach with dom.StatsHandle().Update() to force a synchronous full stats cache refresh from storage after analyze. This is the standard pattern used by other stats tests and avoids CI timeouts with large datasets. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mjonss · 2026-02-15T22:51:09Z

Investigating if #66289 is the better solution, i.e. build the TopN and Histograms from samples gathered from the partitions, using the same input as the TopN and Histograms for the partitions itself. That should be both more efficient as well as more accurate.

ti-chi-bot bot added the ok-to-test Indicates a PR is ready to be tested. label Feb 11, 2026

mjonss commented Feb 11, 2026

View reviewed changes

pkg/statistics/handle/globalstats/global_stats_test.go Outdated Show resolved Hide resolved

mjonss added 2 commits February 12, 2026 00:28

Updated test

264772d

bazel_prepare

3f13f1f

ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 12, 2026

mjonss marked this pull request as ready for review February 12, 2026 00:31

ti-chi-bot bot removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/needs-tests-checked labels Feb 12, 2026

mjonss and others added 7 commits February 12, 2026 01:40

Updated tests

9fd07a6

Formatting

8d866bb

statistics: add test verifying V2 TopN merge preserves total count

e6d677f

Verify that MergePartTopN2GlobalTopNV2 never loses or creates counts: globalTopN.TotalCount() + sum(leftTopN counts) == sum(input TopN counts). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

bazel prepare

b55488e

ti-chi-bot bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 12, 2026

mjonss requested a review from Copilot February 12, 2026 16:41

Copilot started reviewing on behalf of mjonss February 12, 2026 16:42 View session

Copilot AI reviewed Feb 12, 2026

View reviewed changes

mjonss and others added 2 commits February 12, 2026 17:42

statistics: revert default merge concurrency from 0 back to 1

7037656

Keep V1 as the default TopN merge strategy. Users can opt in to V2 by setting tidb_merge_partition_stats_concurrency = 0. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mjonss and others added 3 commits February 13, 2026 00:29

Updated test

6742988

Formatting

4da60f1

mjonss changed the title ~~statistics: add TopN-only merge (V2) for global stats~~ atistics: add hybrid TopN+histogram merge (V2) for global stats Feb 13, 2026

mjonss changed the title ~~atistics: add hybrid TopN+histogram merge (V2) for global stats~~ statistics: add hybrid TopN+histogram merge (V2) for global stats Feb 13, 2026

mjonss and others added 8 commits February 13, 2026 09:09

Formatting

d9b7e4e

bazel_prepare

c168f66

mjonss changed the title ~~statistics: add hybrid TopN+histogram merge (V2) for global stats~~ statistics: add combined TopN+histogram merge for global stats Feb 13, 2026

mjonss marked this pull request as draft February 15, 2026 22:49

ti-chi-bot bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 15, 2026

Conversation

mjonss commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What changed and how does it work?

Check List

Release note

Uh oh!

ti-chi-bot bot commented Feb 11, 2026

Uh oh!

tiprow bot commented Feb 11, 2026

Uh oh!

mjonss commented Feb 11, 2026

Uh oh!

Uh oh!

codecov bot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ti-chi-bot bot commented Feb 12, 2026

Uh oh!

mjonss commented Feb 13, 2026

Uh oh!

mjonss commented Feb 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

mjonss commented Feb 11, 2026 •

edited

Loading

codecov bot commented Feb 11, 2026 •

edited

Loading