pd: report hot read cpu in heartbeat by lhy1024 · Pull Request #10178 · tikv/pd

lhy1024 · 2026-01-21T09:33:24Z

What problem does this PR solve?

Issue Number: Close #5718

What is changed and how does it work?

Simple description

This pr introduces cpu as a new dimension for hot scheduler, it only serve hot read scheduler

From store heartbeat cpu_usages, sum unified‑read and grpc‑server thread CPU by prefix. Read CPU load is computed as unifiedReadCPU + grpcCPU * readQuery/totalQuery (or just unifiedReadCPU if queries are missing). This feeds the read CPUDim in store loads and hot‑peer stats. CPU uses a longer rolling median window; hotness checks use rolling average for CPU and last‑interval average for other dims. Read priorities become cpu→byte when supported, otherwise fall back to query→byte (or byte→key if query isn’t supported).

Check List

Tests

Unit test
Integration test

Release note

None.

Summary by CodeRabbit

New Features
- CPU-based hot-region scheduling: configurable min-hot-cpu-rate and cpu-rate-rank-step-ratio, version-aware enablement, and CPU prioritized/compatible fallbacks.
- CPU load tracking: store- and region-level CPU metrics added to hot-region stats, history entries, and API payloads.
Tests
- Added unit and integration tests covering CPU calculations, config adjustments, and scheduler behavior.
Chores
- Bumped kvproto module version used for builds and tests.

Signed-off-by: lhy1024 <admin@liudos.us>

ti-chi-bot · 2026-01-21T09:33:27Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

ti-chi-bot · 2026-01-21T09:33:30Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign benmeadowcroft, binshi-bing for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS
pkg/schedule/schedulers/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-01-21T09:33:35Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a CPU-based hot-region dimension: collect store/region CPU stats, propagate them through heartbeats and statistics, extend hot-region config/solver/metrics with CPU support and version gating, update storage/CLI/API outputs, add tests, and bump kvproto dependency versions.

Changes

Cohort / File(s)	Summary
Module dependency updates `client/go.mod`, `go.mod`, `tests/integrations/go.mod`, `tools/go.mod`	Bumped `github.com/pingcap/kvproto` version across modules.
Core region model `pkg/core/factory.go`, `pkg/core/region.go`	Added `CPUStatsFactory` and `cpuStats` on `RegionInfo`; populated from heartbeat, deep-cloned, and included in region loads.
CPU statistics implementation & tests `pkg/statistics/cpu.go`, `pkg/statistics/cpu_test.go`	New CPU helpers: `StoreReadCPUUsage`, `RegionReadCPUUsage`, `StoreGRPCCPUUsage` plus unit tests covering proportional gRPC/unified-read logic.
Statistics wiring & rolling windows `pkg/statistics/collector.go`, `pkg/statistics/store.go`, `pkg/statistics/store_collection.go`, `pkg/statistics/hot_peer.go`, `pkg/statistics/hot_peer_cache.go`, `pkg/statistics/hot_regions_stat.go`	Collect store read-CPU, add CPU moving averages/rolling window, add CPU dimension to hot-peer logic and hot stats.
Hot-peer cache & tests `pkg/statistics/hot_peer_cache_test.go`, `pkg/statistics/hot_cache_test.go`, `pkg/statistics/hot_peer.go`, `pkg/statistics/hot_regions_stat.go`	Tests added/adjusted for CPU rolling behavior and loads; hot-peer GetLoads/GetLoad hardened and extended for CPU dim.
Kind/constants `pkg/statistics/utils/kind.go`, `pkg/statistics/utils/constant.go`, `pkg/statistics/utils/kind_test.go`	Added `CPUPriority`/`CPUDim`, `RegionReadCPU`/`StoreReadCPU`, mapping and threshold entry; tests updated.
Hot-region scheduler config & validation `pkg/schedule/schedulers/hot_region_config.go`, `pkg/schedule/schedulers/hot_region_config_test.go`, `pkg/schedule/schedulers/hot_region.go`	Added CPU priority support, new config fields (`MinHotCPURate`, `CPURateRankStepRatio`), cpuSupport checks, adjusted `adjustPrioritiesConfig` signature, and CPU fallback tests.
Hot-region solver, metrics & tests `pkg/schedule/schedulers/hot_region_solver.go`, `pkg/schedule/schedulers/hot_region_solver_test.go`, `pkg/schedule/schedulers/metrics.go`	Integrated `CPUDim` into dim-to-step/min-rate/ranking; added read-skip-cpu counter and updated tests to include CPU dimension.
Store/cluster heartbeat wiring `server/cluster/cluster.go`, `pkg/mcs/scheduling/server/cluster.go`, `pkg/schedule/coordinator.go`	Extracted store gRPC/unified-read CPU and query metrics; computed `RegionReadCPU` per-region in heartbeat processing and appended to region loads.
Handler, storage, CLI & tests `pkg/schedule/handler/handler.go`, `pkg/storage/hot_region_storage.go`, `server/handler.go`, `tools/pd-ctl/...`, `tests/server/api/scheduler_test.go`	Added `CPUReadStats` and `FlowCPU`/`flow_cpu` to history entries, hidden CLI config entry, and updated tests/expected config keys.
Store hot-peers aggregation `pkg/statistics/store_load.go`, `pkg/statistics/store_hot_peers_infos.go`, `pkg/statistics/store_collection_test.go`	Propagated per-peer CPU into `HotPeerStatShow` and aggregate `HotPeersStat` (`StoreCPURate`, `TotalCPURate`), prediction wiring and tests adjusted.
Version gating & tests `pkg/versioninfo/versioninfo.go`, `pkg/versioninfo/versioninfo_test.go`	Added `IsHotScheduleWithCPUSupported` with min versions (8.5.6, 9.0.0-beta.1) and tests across versions.
Misc tests/adjustments `server/cluster/cluster_test.go`, `pkg/schedule/handler/handler.go`, `pkg/schedule/coordinator.go`	Small test/data adjustments to include the extra CPU load dimension and to populate CPU fields in handlers.

Sequence Diagram(s)

sequenceDiagram
participant TiKV as TiKV (store)
participant PD as PD/statistics collector
participant StoreHandler as HeartbeatHandler
participant Scheduler as HotRegionScheduler

TiKV->>PD: send store heartbeat (peers, peer stats, cpu stats)
PD->>PD: aggregate store CPU (unified-read, grpc threads)
PD->>StoreHandler: compute per-region RegionReadCPU (unified + proportional gRPC)
StoreHandler->>Scheduler: publish region loads (bytes, keys, queries, cpu)
Scheduler->>Scheduler: evaluate hot regions (use cpuSupport, thresholds, priorities)
Scheduler->>PD: emit metrics/decisions (including CPU rates)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

core: region heartbeat with bucket meta #10231 — Modifies pkg/core/region.go to add new RegionInfo fields and updates heartbeat/clone behavior (similar pattern to adding cpuStats).

Suggested reviewers

okJiang
rleungx

Poem

🐇 I hopped through heartbeats and threads so deep,
Counted CPU whispers while others slept in sleep.
I added a new dim, a small rhythmic tune,
Tests patted my head and the scheduler hummed soon.

🚥 Pre-merge checks | ✅ 4 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 37.21% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Merge Conflict Detection	⚠️ Warning	❌ Merge conflicts detected (60 files): ⚔️ `client/go.mod` (content) ⚔️ `client/go.sum` (content) ⚔️ `go.mod` (content) ⚔️ `go.sum` (content) ⚔️ `metrics/grafana/pd.json` (content) ⚔️ `pkg/core/factory.go` (content) ⚔️ `pkg/core/region.go` (content) ⚔️ `pkg/mcs/resourcemanager/server/apis/v1/api.go` (content) ⚔️ `pkg/mcs/resourcemanager/server/keyspace_manager.go` (content) ⚔️ `pkg/mcs/resourcemanager/server/manager.go` (content) ⚔️ `pkg/mcs/resourcemanager/server/manager_test.go` (content) ⚔️ `pkg/mcs/resourcemanager/server/resource_group.go` (content) ⚔️ `pkg/mcs/resourcemanager/server/resource_group_test.go` (content) ⚔️ `pkg/mcs/resourcemanager/server/server.go` (content) ⚔️ `pkg/mcs/resourcemanager/server/service_limit.go` (content) ⚔️ `pkg/mcs/resourcemanager/server/token_buckets.go` (content) ⚔️ `pkg/mcs/scheduling/server/cluster.go` (content) ⚔️ `pkg/schedule/coordinator.go` (content) ⚔️ `pkg/schedule/handler/handler.go` (content) ⚔️ `pkg/schedule/schedulers/hot_region.go` (content) ⚔️ `pkg/schedule/schedulers/hot_region_config.go` (content) ⚔️ `pkg/schedule/schedulers/hot_region_solver.go` (content) ⚔️ `pkg/schedule/schedulers/hot_region_solver_test.go` (content) ⚔️ `pkg/schedule/schedulers/hot_region_test.go` (content) ⚔️ `pkg/schedule/schedulers/metrics.go` (content) ⚔️ `pkg/statistics/collector.go` (content) ⚔️ `pkg/statistics/hot_cache_test.go` (content) ⚔️ `pkg/statistics/hot_peer.go` (content) ⚔️ `pkg/statistics/hot_peer_cache.go` (content) ⚔️ `pkg/statistics/hot_peer_cache_test.go` (content) ⚔️ `pkg/statistics/hot_regions_stat.go` (content) ⚔️ `pkg/statistics/store.go` (content) ⚔️ `pkg/statistics/store_collection.go` (content) ⚔️ `pkg/statistics/store_collection_test.go` (content) ⚔️ `pkg/statistics/store_hot_peers_infos.go` (content) ⚔️ `pkg/statistics/store_load.go` (content) ⚔️ `pkg/statistics/utils/constant.go` (content) ⚔️ `pkg/statistics/utils/kind.go` (content) ⚔️ `pkg/statistics/utils/kind_test.go` (content) ⚔️ `pkg/storage/hot_region_storage.go` (content) ⚔️ `pkg/tso/allocator.go` (content) ⚔️ `pkg/tso/metrics.go` (content) ⚔️ `pkg/tso/tso.go` (content) ⚔️ `pkg/versioninfo/versioninfo.go` (content) ⚔️ `server/cluster/cluster.go` (content) ⚔️ `server/cluster/cluster_test.go` (content) ⚔️ `server/grpc_service.go` (content) ⚔️ `server/handler.go` (content) ⚔️ `server/metrics.go` (content) ⚔️ `server/server.go` (content) ⚔️ `tests/cluster.go` (content) ⚔️ `tests/integrations/go.mod` (content) ⚔️ `tests/integrations/go.sum` (content) ⚔️ `tests/server/api/rule_test.go` (content) ⚔️ `tests/server/api/scheduler_test.go` (content) ⚔️ `tools/go.mod` (content) ⚔️ `tools/go.sum` (content) ⚔️ `tools/pd-ctl/pdctl/command/scheduler_command.go` (content) ⚔️ `tools/pd-ctl/tests/hot/hot_test.go` (content) ⚔️ `tools/pd-ctl/tests/scheduler/scheduler_test.go` (content) These conflicts must be resolved before merging into `master`.	Resolve conflicts locally and push changes to this branch.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'pd: report hot read cpu in heartbeat' clearly summarizes the main change: introducing CPU dimension reporting for hot read regions in heartbeat messages.
Description check	✅ Passed	The PR description includes issue number (Close `#5718`), explains what changed and how it works with technical details, and indicates tests were added. Template sections are mostly complete.
Linked Issues check	✅ Passed	The PR implements the objective from `#5718` to introduce CPU as a dimension for hot region scheduler. Code changes add CPU metrics collection, calculation, and scheduling logic to address the issue's goal of detecting hotspots by CPU consumption.
Out of Scope Changes check	✅ Passed	All changes are scoped to implementing CPU scheduling support for hot regions: dependency updates, CPU calculation utilities, hot region scheduling logic, metrics/storage, and related tests. No unrelated changes detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

⚔️ Resolve merge conflicts (beta)

Auto-commit resolved conflicts to branch hot-read-cpu
Post resolved changes as copyable diffs in a comment

No actionable comments were generated in the recent review. 🎉

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: lhy1024 <admin@liudos.us>

codecov · 2026-02-10T12:21:18Z

Codecov Report

❌ Patch coverage is 89.28571% with 18 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.79%. Comparing base (a0758a7) to head (550ffd8).
⚠️ Report is 6 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #10178      +/-   ##
==========================================
+ Coverage   78.76%   78.79%   +0.03%     
==========================================
  Files         522      523       +1     
  Lines       70369    70527     +158     
==========================================
+ Hits        55424    55572     +148     
- Misses      10943    10957      +14     
+ Partials     4002     3998       -4

Flag	Coverage Δ
unittests	`78.79% <89.28%> (+0.03%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: lhy1024 <admin@liudos.us>

okJiang · 2026-02-12T08:06:03Z

please link an issue and add some descriptions

rleungx · 2026-02-14T07:04:46Z

pkg/mcs/scheduling/server/cluster.go

+	storeReadQuery := core.GetReadQueryNum(stats.QueryStats)
+	storeWriteQuery := core.GetWriteQueryNum(stats.QueryStats)
+	storeTotalQuery := storeReadQuery + storeWriteQuery
+	storeGRPCCPU := statistics.StoreGRPCCPUUsage(stats.GetCpuUsages())


Here we intentionally use gRPC CPU only. Unified-read CPU is already in peerStat.CpuStats.UnifiedRead, so using store read CPU here would double count.

rleungx · 2026-02-14T07:10:04Z

pkg/statistics/cpu.go

@@ -0,0 +1,74 @@
+// Copyright 2025 TiKV Project Authors.


Suggested change

// Copyright 2025 TiKV Project Authors.

// Copyright 2026 TiKV Project Authors.

rleungx · 2026-02-14T07:12:00Z

pkg/statistics/cpu.go

+		return unifiedReadCPU
+	}
+	grpcCPU := float64(StoreGRPCCPUUsage(cpuUsages))
+	return unifiedReadCPU + grpcCPU*float64(readQuery)/float64(totalQuery)


Is it accurate?

This is an approximation: unified-read CPU is read-only, while grpc-server CPU is shared by read/write requests, so we apportion gRPC CPU by readQuery/totalQuery.

rleungx · 2026-02-14T07:13:00Z

pkg/statistics/hot_peer_cache.go

 	rollingWindowsSize = 5
+	// It is used to moving average CPU usage,
+	// and the window size is larger than other dimensions to make the CPU usage more stable.
+	cpuRollingWindowsSize = 9


A larger window will be more stable for cpu

rleungx · 2026-02-14T07:13:44Z

pkg/versioninfo/versioninfo.go

+)
+
+// IsHotScheduleWithCPUSupported returns whether TiKV reports CPU info for hot scheduling.
+func IsHotScheduleWithCPUSupported(clusterVersion *semver.Version) bool {


What if we wanna cp to release 8.5?

8.5.6 or 8.5.7?

Signed-off-by: lhy1024 <admin@liudos.us>

ti-chi-bot · 2026-02-14T07:46:20Z

@lhy1024: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-unit-test-next-gen-3	`cbcda2a`	link	true	`/test pull-unit-test-next-gen-3`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

use hot read cpu

77f90dd

Signed-off-by: lhy1024 <admin@liudos.us>

ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jan 21, 2026

lhy1024 added 3 commits January 21, 2026 22:10

fix version

5790ab6

Signed-off-by: lhy1024 <admin@liudos.us>

adjust sample windows

531ed27

Signed-off-by: lhy1024 <admin@liudos.us>

fix statistics

5b42858

Signed-off-by: lhy1024 <admin@liudos.us>

lhy1024 force-pushed the hot-read-cpu branch from b199e08 to 5b42858 Compare February 10, 2026 09:44

lhy1024 added 3 commits February 10, 2026 18:16

add comments and tests

a61cfc8

Signed-off-by: lhy1024 <admin@liudos.us>

Merge branch 'master' of github.com:tikv/pd into hot-read-cpu

7e04125

Signed-off-by: lhy1024 <admin@liudos.us>

fix lint

35223a2

Signed-off-by: lhy1024 <admin@liudos.us>

lhy1024 marked this pull request as ready for review February 10, 2026 10:36

ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 10, 2026

fix tests

d9310e1

Signed-off-by: lhy1024 <admin@liudos.us>

update kvproto

550ffd8

Signed-off-by: lhy1024 <admin@liudos.us>

lhy1024 force-pushed the hot-read-cpu branch from d0d3233 to 550ffd8 Compare February 11, 2026 12:41

ti-chi-bot bot removed the do-not-merge/needs-linked-issue label Feb 14, 2026

lhy1024 requested review from okJiang and rleungx February 14, 2026 02:24

rleungx reviewed Feb 14, 2026

View reviewed changes

address comments

cbcda2a

Signed-off-by: lhy1024 <admin@liudos.us>

	// Copyright 2025 TiKV Project Authors.
	// Copyright 2026 TiKV Project Authors.

Conversation

lhy1024 commented Jan 21, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and how does it work?

Simple description

Check List

Release note

Summary by CodeRabbit

Uh oh!

ti-chi-bot bot commented Jan 21, 2026

Uh oh!

ti-chi-bot bot commented Jan 21, 2026

Uh oh!

coderabbitai bot commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

codecov bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

okJiang commented Feb 12, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ti-chi-bot bot commented Feb 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lhy1024 commented Jan 21, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 21, 2026 •

edited

Loading

codecov bot commented Feb 10, 2026 •

edited

Loading