Skip to content

pd: report hot read cpu in heartbeat#10178

Open
lhy1024 wants to merge 10 commits intotikv:masterfrom
lhy1024:hot-read-cpu
Open

pd: report hot read cpu in heartbeat#10178
lhy1024 wants to merge 10 commits intotikv:masterfrom
lhy1024:hot-read-cpu

Conversation

@lhy1024
Copy link
Contributor

@lhy1024 lhy1024 commented Jan 21, 2026

What problem does this PR solve?

Issue Number: Close #5718

What is changed and how does it work?

Simple description

This pr introduces cpu as a new dimension for hot scheduler, it only serve hot read scheduler

From store heartbeat cpu_usages, sum unified‑read and grpc‑server thread CPU by prefix. Read CPU load is computed as unifiedReadCPU + grpcCPU * readQuery/totalQuery (or just unifiedReadCPU if queries are missing). This feeds the read CPUDim in store loads and hot‑peer stats. CPU uses a longer rolling median window; hotness checks use rolling average for CPU and last‑interval average for other dims. Read priorities become cpu→byte when supported, otherwise fall back to query→byte (or byte→key if query isn’t supported).

Check List

Tests

  • Unit test
  • Integration test

Release note

None.

Summary by CodeRabbit

  • New Features

    • CPU-based hot-region scheduling: configurable min-hot-cpu-rate and cpu-rate-rank-step-ratio, version-aware enablement, and CPU prioritized/compatible fallbacks.
    • CPU load tracking: store- and region-level CPU metrics added to hot-region stats, history entries, and API payloads.
  • Tests

    • Added unit and integration tests covering CPU calculations, config adjustments, and scheduler behavior.
  • Chores

    • Bumped kvproto module version used for builds and tests.

Signed-off-by: lhy1024 <admin@liudos.us>
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Jan 21, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. dco-signoff: yes Indicates the PR's author has signed the dco. labels Jan 21, 2026
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Jan 21, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign benmeadowcroft, binshi-bing for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link

coderabbitai bot commented Jan 21, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a CPU-based hot-region dimension: collect store/region CPU stats, propagate them through heartbeats and statistics, extend hot-region config/solver/metrics with CPU support and version gating, update storage/CLI/API outputs, add tests, and bump kvproto dependency versions.

Changes

Cohort / File(s) Summary
Module dependency updates
client/go.mod, go.mod, tests/integrations/go.mod, tools/go.mod
Bumped github.com/pingcap/kvproto version across modules.
Core region model
pkg/core/factory.go, pkg/core/region.go
Added CPUStatsFactory and cpuStats on RegionInfo; populated from heartbeat, deep-cloned, and included in region loads.
CPU statistics implementation & tests
pkg/statistics/cpu.go, pkg/statistics/cpu_test.go
New CPU helpers: StoreReadCPUUsage, RegionReadCPUUsage, StoreGRPCCPUUsage plus unit tests covering proportional gRPC/unified-read logic.
Statistics wiring & rolling windows
pkg/statistics/collector.go, pkg/statistics/store.go, pkg/statistics/store_collection.go, pkg/statistics/hot_peer.go, pkg/statistics/hot_peer_cache.go, pkg/statistics/hot_regions_stat.go
Collect store read-CPU, add CPU moving averages/rolling window, add CPU dimension to hot-peer logic and hot stats.
Hot-peer cache & tests
pkg/statistics/hot_peer_cache_test.go, pkg/statistics/hot_cache_test.go, pkg/statistics/hot_peer.go, pkg/statistics/hot_regions_stat.go
Tests added/adjusted for CPU rolling behavior and loads; hot-peer GetLoads/GetLoad hardened and extended for CPU dim.
Kind/constants
pkg/statistics/utils/kind.go, pkg/statistics/utils/constant.go, pkg/statistics/utils/kind_test.go
Added CPUPriority/CPUDim, RegionReadCPU/StoreReadCPU, mapping and threshold entry; tests updated.
Hot-region scheduler config & validation
pkg/schedule/schedulers/hot_region_config.go, pkg/schedule/schedulers/hot_region_config_test.go, pkg/schedule/schedulers/hot_region.go
Added CPU priority support, new config fields (MinHotCPURate, CPURateRankStepRatio), cpuSupport checks, adjusted adjustPrioritiesConfig signature, and CPU fallback tests.
Hot-region solver, metrics & tests
pkg/schedule/schedulers/hot_region_solver.go, pkg/schedule/schedulers/hot_region_solver_test.go, pkg/schedule/schedulers/metrics.go
Integrated CPUDim into dim-to-step/min-rate/ranking; added read-skip-cpu counter and updated tests to include CPU dimension.
Store/cluster heartbeat wiring
server/cluster/cluster.go, pkg/mcs/scheduling/server/cluster.go, pkg/schedule/coordinator.go
Extracted store gRPC/unified-read CPU and query metrics; computed RegionReadCPU per-region in heartbeat processing and appended to region loads.
Handler, storage, CLI & tests
pkg/schedule/handler/handler.go, pkg/storage/hot_region_storage.go, server/handler.go, tools/pd-ctl/..., tests/server/api/scheduler_test.go
Added CPUReadStats and FlowCPU/flow_cpu to history entries, hidden CLI config entry, and updated tests/expected config keys.
Store hot-peers aggregation
pkg/statistics/store_load.go, pkg/statistics/store_hot_peers_infos.go, pkg/statistics/store_collection_test.go
Propagated per-peer CPU into HotPeerStatShow and aggregate HotPeersStat (StoreCPURate, TotalCPURate), prediction wiring and tests adjusted.
Version gating & tests
pkg/versioninfo/versioninfo.go, pkg/versioninfo/versioninfo_test.go
Added IsHotScheduleWithCPUSupported with min versions (8.5.6, 9.0.0-beta.1) and tests across versions.
Misc tests/adjustments
server/cluster/cluster_test.go, pkg/schedule/handler/handler.go, pkg/schedule/coordinator.go
Small test/data adjustments to include the extra CPU load dimension and to populate CPU fields in handlers.

Sequence Diagram(s)

sequenceDiagram
participant TiKV as TiKV (store)
participant PD as PD/statistics collector
participant StoreHandler as HeartbeatHandler
participant Scheduler as HotRegionScheduler

TiKV->>PD: send store heartbeat (peers, peer stats, cpu stats)
PD->>PD: aggregate store CPU (unified-read, grpc threads)
PD->>StoreHandler: compute per-region RegionReadCPU (unified + proportional gRPC)
StoreHandler->>Scheduler: publish region loads (bytes, keys, queries, cpu)
Scheduler->>Scheduler: evaluate hot regions (use cpuSupport, thresholds, priorities)
Scheduler->>PD: emit metrics/decisions (including CPU rates)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • okJiang
  • rleungx

Poem

🐇 I hopped through heartbeats and threads so deep,
Counted CPU whispers while others slept in sleep.
I added a new dim, a small rhythmic tune,
Tests patted my head and the scheduler hummed soon.

🚥 Pre-merge checks | ✅ 4 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.21% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Merge Conflict Detection ⚠️ Warning ❌ Merge conflicts detected (60 files):

⚔️ client/go.mod (content)
⚔️ client/go.sum (content)
⚔️ go.mod (content)
⚔️ go.sum (content)
⚔️ metrics/grafana/pd.json (content)
⚔️ pkg/core/factory.go (content)
⚔️ pkg/core/region.go (content)
⚔️ pkg/mcs/resourcemanager/server/apis/v1/api.go (content)
⚔️ pkg/mcs/resourcemanager/server/keyspace_manager.go (content)
⚔️ pkg/mcs/resourcemanager/server/manager.go (content)
⚔️ pkg/mcs/resourcemanager/server/manager_test.go (content)
⚔️ pkg/mcs/resourcemanager/server/resource_group.go (content)
⚔️ pkg/mcs/resourcemanager/server/resource_group_test.go (content)
⚔️ pkg/mcs/resourcemanager/server/server.go (content)
⚔️ pkg/mcs/resourcemanager/server/service_limit.go (content)
⚔️ pkg/mcs/resourcemanager/server/token_buckets.go (content)
⚔️ pkg/mcs/scheduling/server/cluster.go (content)
⚔️ pkg/schedule/coordinator.go (content)
⚔️ pkg/schedule/handler/handler.go (content)
⚔️ pkg/schedule/schedulers/hot_region.go (content)
⚔️ pkg/schedule/schedulers/hot_region_config.go (content)
⚔️ pkg/schedule/schedulers/hot_region_solver.go (content)
⚔️ pkg/schedule/schedulers/hot_region_solver_test.go (content)
⚔️ pkg/schedule/schedulers/hot_region_test.go (content)
⚔️ pkg/schedule/schedulers/metrics.go (content)
⚔️ pkg/statistics/collector.go (content)
⚔️ pkg/statistics/hot_cache_test.go (content)
⚔️ pkg/statistics/hot_peer.go (content)
⚔️ pkg/statistics/hot_peer_cache.go (content)
⚔️ pkg/statistics/hot_peer_cache_test.go (content)
⚔️ pkg/statistics/hot_regions_stat.go (content)
⚔️ pkg/statistics/store.go (content)
⚔️ pkg/statistics/store_collection.go (content)
⚔️ pkg/statistics/store_collection_test.go (content)
⚔️ pkg/statistics/store_hot_peers_infos.go (content)
⚔️ pkg/statistics/store_load.go (content)
⚔️ pkg/statistics/utils/constant.go (content)
⚔️ pkg/statistics/utils/kind.go (content)
⚔️ pkg/statistics/utils/kind_test.go (content)
⚔️ pkg/storage/hot_region_storage.go (content)
⚔️ pkg/tso/allocator.go (content)
⚔️ pkg/tso/metrics.go (content)
⚔️ pkg/tso/tso.go (content)
⚔️ pkg/versioninfo/versioninfo.go (content)
⚔️ server/cluster/cluster.go (content)
⚔️ server/cluster/cluster_test.go (content)
⚔️ server/grpc_service.go (content)
⚔️ server/handler.go (content)
⚔️ server/metrics.go (content)
⚔️ server/server.go (content)
⚔️ tests/cluster.go (content)
⚔️ tests/integrations/go.mod (content)
⚔️ tests/integrations/go.sum (content)
⚔️ tests/server/api/rule_test.go (content)
⚔️ tests/server/api/scheduler_test.go (content)
⚔️ tools/go.mod (content)
⚔️ tools/go.sum (content)
⚔️ tools/pd-ctl/pdctl/command/scheduler_command.go (content)
⚔️ tools/pd-ctl/tests/hot/hot_test.go (content)
⚔️ tools/pd-ctl/tests/scheduler/scheduler_test.go (content)

These conflicts must be resolved before merging into master.
Resolve conflicts locally and push changes to this branch.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'pd: report hot read cpu in heartbeat' clearly summarizes the main change: introducing CPU dimension reporting for hot read regions in heartbeat messages.
Description check ✅ Passed The PR description includes issue number (Close #5718), explains what changed and how it works with technical details, and indicates tests were added. Template sections are mostly complete.
Linked Issues check ✅ Passed The PR implements the objective from #5718 to introduce CPU as a dimension for hot region scheduler. Code changes add CPU metrics collection, calculation, and scheduling logic to address the issue's goal of detecting hotspots by CPU consumption.
Out of Scope Changes check ✅ Passed All changes are scoped to implementing CPU scheduling support for hot regions: dependency updates, CPU calculation utilities, hot region scheduling logic, metrics/storage, and related tests. No unrelated changes detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
⚔️ Resolve merge conflicts (beta)
  • Auto-commit resolved conflicts to branch hot-read-cpu
  • Post resolved changes as copyable diffs in a comment

No actionable comments were generated in the recent review. 🎉


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jan 21, 2026
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
Signed-off-by: lhy1024 <admin@liudos.us>
@lhy1024 lhy1024 marked this pull request as ready for review February 10, 2026 10:36
@ti-chi-bot ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 10, 2026
Signed-off-by: lhy1024 <admin@liudos.us>
@codecov
Copy link

codecov bot commented Feb 10, 2026

Codecov Report

❌ Patch coverage is 89.28571% with 18 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.79%. Comparing base (a0758a7) to head (550ffd8).
⚠️ Report is 6 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10178      +/-   ##
==========================================
+ Coverage   78.76%   78.79%   +0.03%     
==========================================
  Files         522      523       +1     
  Lines       70369    70527     +158     
==========================================
+ Hits        55424    55572     +148     
- Misses      10943    10957      +14     
+ Partials     4002     3998       -4     
Flag Coverage Δ
unittests 78.79% <89.28%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: lhy1024 <admin@liudos.us>
@okJiang
Copy link
Member

okJiang commented Feb 12, 2026

please link an issue and add some descriptions

storeReadQuery := core.GetReadQueryNum(stats.QueryStats)
storeWriteQuery := core.GetWriteQueryNum(stats.QueryStats)
storeTotalQuery := storeReadQuery + storeWriteQuery
storeGRPCCPU := statistics.StoreGRPCCPUUsage(stats.GetCpuUsages())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we intentionally use gRPC CPU only. Unified-read CPU is already in peerStat.CpuStats.UnifiedRead, so using store read CPU here would double count.

@@ -0,0 +1,74 @@
// Copyright 2025 TiKV Project Authors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Copyright 2025 TiKV Project Authors.
// Copyright 2026 TiKV Project Authors.

return unifiedReadCPU
}
grpcCPU := float64(StoreGRPCCPUUsage(cpuUsages))
return unifiedReadCPU + grpcCPU*float64(readQuery)/float64(totalQuery)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it accurate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an approximation: unified-read CPU is read-only, while grpc-server CPU is shared by read/write requests, so we apportion gRPC CPU by readQuery/totalQuery.

rollingWindowsSize = 5
// It is used to moving average CPU usage,
// and the window size is larger than other dimensions to make the CPU usage more stable.
cpuRollingWindowsSize = 9
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 9?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A larger window will be more stable for cpu

)

// IsHotScheduleWithCPUSupported returns whether TiKV reports CPU info for hot scheduling.
func IsHotScheduleWithCPUSupported(clusterVersion *semver.Version) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we wanna cp to release 8.5?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8.5.6 or 8.5.7?

Signed-off-by: lhy1024 <admin@liudos.us>
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Feb 14, 2026

@lhy1024: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-unit-test-next-gen-3 cbcda2a link true /test pull-unit-test-next-gen-3

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the dco. release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

scheduler: introduce read cpu dimension

3 participants