Skip to content

Conversation

@mymeiyi
Copy link
Contributor

@mymeiyi mymeiyi commented Feb 12, 2026

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

Copilot AI review requested due to automatic review settings February 12, 2026 06:23
@mymeiyi mymeiyi requested a review from w41ter as a code owner February 12, 2026 06:23
@Thearas
Copy link
Contributor

Thearas commented Feb 12, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates FE checkpoint/image generation for cloud mode so that the saved image includes cloud-specific runtime metadata (table/partition versions and tablet/replica stats), reducing reliance on rebuilding those values after restart.

Changes:

  • Add a cloud-mode post-processing step during checkpoint generation to copy table/partition versions and replica stats from the serving env into the checkpoint env before saving the image.
  • Persist additional cloud metadata by adding Gson @SerializedName annotations (e.g., table cached version, replica rowset/segment counts).
  • Make OlapTable.setCachedTableVersion() callable from checkpoint code.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
fe/fe-core/src/main/java/org/apache/doris/master/Checkpoint.java Adds postProcessCloudMetadata() to copy versions and tablet stats into the checkpoint catalog before saveImage().
fe/fe-core/src/main/java/org/apache/doris/cloud/catalog/CloudReplica.java Persists segmentCount and rowsetCount into image via @SerializedName.
fe/fe-core/src/main/java/org/apache/doris/catalog/OlapTable.java Persists cached table version via @SerializedName and exposes setter for checkpoint to populate it.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 240 to 244
// Cache for table version in cloud mode
// This value is set when get the table version from meta-service, 0 means version is not cached yet
private volatile long lastTableVersionCachedTimeMs = 0;
@SerializedName(value = "cv")
private volatile long cachedTableVersion = -1;
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cachedTableVersion is now serialized, but lastTableVersionCachedTimeMs is still not serialized (and will reset to 0 when loading from image). That makes the cache look immediately expired after restart, so getVisibleVersion() will still go to meta-service and the persisted cachedTableVersion likely won’t be used. Consider either persisting lastTableVersionCachedTimeMs as well, or (preferred) initializing lastTableVersionCachedTimeMs during gson post-process when a non-(-1) cached version is loaded so the cache is treated as fresh at image load time.

Copilot uses AI. Check for mistakes.
Comment on lines +419 to +423
Database servingDb = servingEnv.getInternalCatalog().getDbNullable(db.getId());
if (servingDb == null) {
LOG.warn("serving db is null. dbId: {}, dbName: {}", db.getId(), db.getFullName());
continue;
}
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The checkpoint env is replayed only up to checkPointVersion, so it’s expected that some db/table/partition/index objects may not exist (or may differ) in servingEnv due to concurrent DDLs after the checkpoint snapshot point. Logging each mismatch at WARN can create noisy logs during normal operation; consider downgrading these to DEBUG or aggregating counts and logging a single WARN/INFO summary per checkpoint.

Copilot uses AI. Check for mistakes.
Comment on lines 156 to 158
env.postProcessAfterMetadataReplayed(false);
postProcessCloudMetadata();
latestImageFilePath = env.saveImage();
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This introduces new cloud-mode checkpoint behavior (postProcessCloudMetadata) that mutates the replayed catalog before saveImage(). There are existing tests that exercise checkpoint serialization/deserialization, but nothing here asserts that table cached versions / partition versions / replica stats are actually persisted and restored correctly in cloud mode. Please add a unit/integration test that creates an OlapTable + replicas, sets these fields, runs checkpoint image serialize/deserialize, and verifies the values survive.

Copilot uses AI. Check for mistakes.
@mymeiyi
Copy link
Contributor Author

mymeiyi commented Feb 12, 2026

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 30543 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit f32e8e3b943cde1ee8c2d00157c6b2b77c387ac0, data reload: false

------ Round 1 ----------------------------------
q1	17604	4403	4301	4301
q2	2011	338	234	234
q3	10184	1324	760	760
q4	10192	790	309	309
q5	7503	2226	1975	1975
q6	195	178	144	144
q7	941	724	617	617
q8	9268	1426	1146	1146
q9	4852	4704	4649	4649
q10	6858	1955	1552	1552
q11	465	260	247	247
q12	402	377	225	225
q13	17778	4050	3249	3249
q14	231	241	215	215
q15	889	808	805	805
q16	698	668	645	645
q17	706	863	498	498
q18	6688	5712	6195	5712
q19	1148	1044	677	677
q20	560	523	411	411
q21	2811	1914	1936	1914
q22	342	287	258	258
Total cold run time: 102326 ms
Total hot run time: 30543 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4586	4567	4666	4567
q2	268	358	273	273
q3	2302	2859	2468	2468
q4	1526	1874	1380	1380
q5	4956	4721	4524	4524
q6	223	184	137	137
q7	2023	1907	1875	1875
q8	2570	2398	2457	2398
q9	7612	7501	7656	7501
q10	2847	3068	2708	2708
q11	540	488	423	423
q12	725	764	589	589
q13	4024	4258	3480	3480
q14	265	280	273	273
q15	820	795	770	770
q16	644	673	637	637
q17	1077	1285	1329	1285
q18	7576	7355	7382	7355
q19	837	770	806	770
q20	1966	2043	1874	1874
q21	4636	4307	4177	4177
q22	510	454	423	423
Total cold run time: 52533 ms
Total hot run time: 49887 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 189007 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit f32e8e3b943cde1ee8c2d00157c6b2b77c387ac0, data reload: false

query5	4400	627	520	520
query6	315	224	204	204
query7	4220	465	259	259
query8	332	250	233	233
query9	8725	2715	2740	2715
query10	523	370	325	325
query11	17287	17070	16828	16828
query12	179	129	131	129
query13	1257	447	352	352
query14	6052	3233	2917	2917
query14_1	2836	2774	2750	2750
query15	213	189	177	177
query16	968	477	449	449
query17	1084	688	595	595
query18	2439	433	340	340
query19	211	209	183	183
query20	138	128	129	128
query21	229	146	127	127
query22	4774	4877	4982	4877
query23	17444	17016	16681	16681
query23_1	16957	16851	16774	16774
query24	7233	1613	1234	1234
query24_1	1238	1240	1236	1236
query25	552	464	443	443
query26	1241	280	159	159
query27	2747	457	295	295
query28	4543	1879	1869	1869
query29	815	561	476	476
query30	321	261	215	215
query31	882	740	645	645
query32	88	82	77	77
query33	541	344	303	303
query34	927	921	571	571
query35	646	679	595	595
query36	1113	1148	972	972
query37	140	102	87	87
query38	2990	2931	2856	2856
query39	891	884	836	836
query39_1	807	914	812	812
query40	216	131	131	131
query41	65	64	67	64
query42	107	103	103	103
query43	397	399	352	352
query44	1311	714	713	713
query45	200	189	185	185
query46	882	985	599	599
query47	2132	2140	2064	2064
query48	301	311	231	231
query49	608	433	343	343
query50	687	280	209	209
query51	4190	4098	4082	4082
query52	103	105	96	96
query53	296	339	288	288
query54	299	276	259	259
query55	88	89	79	79
query56	317	305	312	305
query57	1363	1329	1261	1261
query58	285	272	265	265
query59	2613	2698	2592	2592
query60	348	333	306	306
query61	150	146	147	146
query62	622	595	539	539
query63	314	278	275	275
query64	4900	1235	952	952
query65	4623	4531	4530	4530
query66	1478	461	351	351
query67	16551	16559	16405	16405
query68	2501	1078	718	718
query69	404	310	270	270
query70	1020	965	1001	965
query71	337	313	298	298
query72	2938	2778	2591	2591
query73	532	544	320	320
query74	9605	9544	9325	9325
query75	2789	2757	2418	2418
query76	2296	1061	655	655
query77	359	372	310	310
query78	10893	11062	10415	10415
query79	1068	945	618	618
query80	1297	568	509	509
query81	543	279	249	249
query82	1267	150	114	114
query83	347	265	242	242
query84	244	119	104	104
query85	931	471	419	419
query86	415	331	299	299
query87	3140	3098	2974	2974
query88	3550	2680	2648	2648
query89	435	374	349	349
query90	1936	173	169	169
query91	167	162	131	131
query92	74	69	70	69
query93	896	854	496	496
query94	645	325	305	305
query95	591	408	324	324
query96	637	503	228	228
query97	2465	2470	2402	2402
query98	226	215	217	215
query99	1015	985	927	927
Total cold run time: 261421 ms
Total hot run time: 189007 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 28.85 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit f32e8e3b943cde1ee8c2d00157c6b2b77c387ac0, data reload: false

query1	0.05	0.05	0.05
query2	0.09	0.04	0.04
query3	0.25	0.08	0.08
query4	1.61	0.11	0.10
query5	0.27	0.25	0.25
query6	1.15	0.68	0.67
query7	0.03	0.03	0.02
query8	0.05	0.04	0.03
query9	0.57	0.49	0.50
query10	0.56	0.54	0.55
query11	0.14	0.09	0.09
query12	0.13	0.11	0.10
query13	0.63	0.60	0.61
query14	1.06	1.05	1.05
query15	0.89	0.88	0.88
query16	0.41	0.39	0.42
query17	1.12	1.15	1.11
query18	0.22	0.21	0.21
query19	2.09	2.02	2.05
query20	0.02	0.02	0.01
query21	15.41	0.26	0.15
query22	5.13	0.06	0.05
query23	15.87	0.29	0.11
query24	1.47	0.82	0.90
query25	0.10	0.12	0.14
query26	0.14	0.14	0.13
query27	0.08	0.06	0.06
query28	4.89	1.13	0.96
query29	12.56	3.98	3.18
query30	0.27	0.14	0.12
query31	2.81	0.67	0.40
query32	3.23	0.58	0.50
query33	3.27	3.28	3.20
query34	16.35	5.39	4.68
query35	4.76	4.79	4.76
query36	0.65	0.50	0.51
query37	0.11	0.07	0.07
query38	0.07	0.04	0.04
query39	0.05	0.03	0.03
query40	0.19	0.16	0.16
query41	0.09	0.04	0.03
query42	0.04	0.02	0.02
query43	0.05	0.03	0.04
Total cold run time: 98.93 s
Total hot run time: 28.85 s

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 62.82% (49/78) 🎉
Increment coverage report
Complete coverage report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants