-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[fix](cloud) checkpoint save cloud version and tablet stats to image #60705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR updates FE checkpoint/image generation for cloud mode so that the saved image includes cloud-specific runtime metadata (table/partition versions and tablet/replica stats), reducing reliance on rebuilding those values after restart.
Changes:
- Add a cloud-mode post-processing step during checkpoint generation to copy table/partition versions and replica stats from the serving env into the checkpoint env before saving the image.
- Persist additional cloud metadata by adding Gson
@SerializedNameannotations (e.g., table cached version, replica rowset/segment counts). - Make
OlapTable.setCachedTableVersion()callable from checkpoint code.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
fe/fe-core/src/main/java/org/apache/doris/master/Checkpoint.java |
Adds postProcessCloudMetadata() to copy versions and tablet stats into the checkpoint catalog before saveImage(). |
fe/fe-core/src/main/java/org/apache/doris/cloud/catalog/CloudReplica.java |
Persists segmentCount and rowsetCount into image via @SerializedName. |
fe/fe-core/src/main/java/org/apache/doris/catalog/OlapTable.java |
Persists cached table version via @SerializedName and exposes setter for checkpoint to populate it. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Cache for table version in cloud mode | ||
| // This value is set when get the table version from meta-service, 0 means version is not cached yet | ||
| private volatile long lastTableVersionCachedTimeMs = 0; | ||
| @SerializedName(value = "cv") | ||
| private volatile long cachedTableVersion = -1; |
Copilot
AI
Feb 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cachedTableVersion is now serialized, but lastTableVersionCachedTimeMs is still not serialized (and will reset to 0 when loading from image). That makes the cache look immediately expired after restart, so getVisibleVersion() will still go to meta-service and the persisted cachedTableVersion likely won’t be used. Consider either persisting lastTableVersionCachedTimeMs as well, or (preferred) initializing lastTableVersionCachedTimeMs during gson post-process when a non-(-1) cached version is loaded so the cache is treated as fresh at image load time.
| Database servingDb = servingEnv.getInternalCatalog().getDbNullable(db.getId()); | ||
| if (servingDb == null) { | ||
| LOG.warn("serving db is null. dbId: {}, dbName: {}", db.getId(), db.getFullName()); | ||
| continue; | ||
| } |
Copilot
AI
Feb 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The checkpoint env is replayed only up to checkPointVersion, so it’s expected that some db/table/partition/index objects may not exist (or may differ) in servingEnv due to concurrent DDLs after the checkpoint snapshot point. Logging each mismatch at WARN can create noisy logs during normal operation; consider downgrading these to DEBUG or aggregating counts and logging a single WARN/INFO summary per checkpoint.
| env.postProcessAfterMetadataReplayed(false); | ||
| postProcessCloudMetadata(); | ||
| latestImageFilePath = env.saveImage(); |
Copilot
AI
Feb 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This introduces new cloud-mode checkpoint behavior (postProcessCloudMetadata) that mutates the replayed catalog before saveImage(). There are existing tests that exercise checkpoint serialization/deserialization, but nothing here asserts that table cached versions / partition versions / replica stats are actually persisted and restored correctly in cloud mode. Please add a unit/integration test that creates an OlapTable + replicas, sets these fields, runs checkpoint image serialize/deserialize, and verifies the values survive.
|
run buildall |
TPC-H: Total hot run time: 30543 ms |
TPC-DS: Total hot run time: 189007 ms |
ClickBench: Total hot run time: 28.85 s |
FE Regression Coverage ReportIncrement line coverage |
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)