[segment replication] Simplify the ckp verification for the primary shard relocation#20609
[segment replication] Simplify the ckp verification for the primary shard relocation#20609guojialiang92 wants to merge 4 commits intoopensearch-project:mainfrom
Conversation
Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR adds handoff state tracking to detect relocation in progress and removes the isRetry parameter from segment replication target classes. New accessor methods expose handoff state, validation is added before replication setup, and the isRetry field and related gating logic are eliminated from replication target constructors and method signatures. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Comment |
|
❗ AI-powered Code-Diff-Analyzer found issues on commit dccf9cc.
The table above displays the top 10 most important findings. Pull Requests Author(s): Please update your Pull Request according to the report above. Repository Maintainer(s): You can Thanks. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@server/src/main/java/org/opensearch/index/seqno/ReplicationTracker.java`:
- Around line 849-851: The public accessor isHandoffInProgress() reads the
non-volatile field handoffInProgress without synchronization, risking stale
reads across threads; fix by making handoffInProgress volatile or by
synchronizing accessors in ReplicationTracker (e.g., mark handoffInProgress as
volatile or add synchronized to isHandoffInProgress() and any writers) so
readers see up-to-date handoff state during relocation handoff checks.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
server/src/main/java/org/opensearch/index/seqno/ReplicationTracker.javaserver/src/main/java/org/opensearch/index/shard/IndexShard.javaserver/src/main/java/org/opensearch/indices/replication/AbstractSegmentReplicationTarget.javaserver/src/main/java/org/opensearch/indices/replication/MergedSegmentReplicationTarget.javaserver/src/main/java/org/opensearch/indices/replication/SegmentReplicationSourceService.javaserver/src/main/java/org/opensearch/indices/replication/SegmentReplicationTarget.javaserver/src/main/java/org/opensearch/indices/replication/SegmentReplicationTargetService.javaserver/src/main/java/org/opensearch/indices/replication/SegmentReplicator.javaserver/src/test/java/org/opensearch/indices/replication/SegmentReplicationTargetServiceTests.java
💤 Files with no reviewable changes (1)
- server/src/main/java/org/opensearch/indices/replication/AbstractSegmentReplicationTarget.java
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2026-01-13T17:40:27.167Z
Learnt from: reta
Repo: opensearch-project/OpenSearch PR: 20411
File: server/src/main/java/org/opensearch/index/codec/CodecService.java:112-133
Timestamp: 2026-01-13T17:40:27.167Z
Learning: Avoid capturing or evaluating a supplier (e.g., this::defaultCodec) upfront when passing it to a registry during object construction. If registries may replace defaults during iteration (as in EnginePlugin.getAdditionalCodecs), pass the supplier itself and only resolve it at use time. This ensures dynamic behavior is preserved during initialization and prevents premature binding of defaults in codecs/registry setup. This pattern should apply to similar initialization paths in Java server code where registries may mutate defaults during construction.
Applied to files:
server/src/main/java/org/opensearch/indices/replication/SegmentReplicationSourceService.javaserver/src/main/java/org/opensearch/indices/replication/SegmentReplicator.javaserver/src/main/java/org/opensearch/index/shard/IndexShard.javaserver/src/main/java/org/opensearch/index/seqno/ReplicationTracker.javaserver/src/main/java/org/opensearch/indices/replication/SegmentReplicationTarget.javaserver/src/main/java/org/opensearch/indices/replication/SegmentReplicationTargetService.javaserver/src/main/java/org/opensearch/indices/replication/MergedSegmentReplicationTarget.java
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: gradle-check
- GitHub Check: precommit (25, ubuntu-latest)
- GitHub Check: precommit (21, macos-15-intel)
- GitHub Check: precommit (21, ubuntu-latest)
- GitHub Check: precommit (25, ubuntu-24.04-arm)
- GitHub Check: precommit (25, macos-15)
- GitHub Check: precommit (21, windows-latest)
- GitHub Check: Analyze (java)
- GitHub Check: precommit (21, ubuntu-24.04-arm)
- GitHub Check: precommit (21, windows-2025, true)
- GitHub Check: precommit (25, windows-latest)
- GitHub Check: precommit (21, macos-15)
- GitHub Check: precommit (25, macos-15-intel)
- GitHub Check: detect-breaking-change
- GitHub Check: assemble (25, windows-latest)
- GitHub Check: assemble (25, ubuntu-latest)
- GitHub Check: assemble (25, ubuntu-24.04-arm)
- GitHub Check: assemble (21, windows-latest)
- GitHub Check: assemble (21, ubuntu-24.04-arm)
- GitHub Check: assemble (21, ubuntu-latest)
🔇 Additional comments (13)
server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTarget.java (2)
39-46: Constructor update looks good.
The new super call aligns with the simplified constructor signature.
130-131: Retry copy update looks good.
The retry path now uses the simplified constructor consistently.server/src/main/java/org/opensearch/indices/replication/MergedSegmentReplicationTarget.java (1)
30-37: Constructor update looks good.
The superclass delegation matches the new signature across the target hierarchy.server/src/main/java/org/opensearch/index/shard/IndexShard.java (1)
4383-4385: Clean delegation for handoff state.
Simple accessor keeps the API consistent and reduces direct tracker access elsewhere.server/src/main/java/org/opensearch/indices/replication/SegmentReplicationSourceService.java (1)
132-146: Good pre-check before replication setup.
Blocking replication while a handoff is in progress is the right safety net.server/src/main/java/org/opensearch/indices/replication/SegmentReplicator.java (1)
112-121: LGTM!The
startReplicationmethod correctly constructsSegmentReplicationTargetwith the simplified 4-argument signature (removingisRetry), consistent with the broader refactor to eliminate retry-gating at the call site.server/src/test/java/org/opensearch/indices/replication/SegmentReplicationTargetServiceTests.java (3)
479-479: LGTM!The verification correctly reflects the updated 3-argument
startReplicationsignature after removing theisRetryparameter.
630-634: LGTM!The listener argument index correctly updated from position 3 to position 2 to match the new 3-argument
startReplication(IndexShard, ReplicationCheckpoint, SegmentReplicationListener)signature.
678-678: LGTM!Verification updated to use the 3-argument form, consistent with the API change.
server/src/main/java/org/opensearch/indices/replication/SegmentReplicationTargetService.java (4)
298-298: LGTM!The
onNewCheckpointmethod now has a cleaner signature without theisRetryparameter, simplifying the checkpoint processing flow.
335-372: LGTM!The replication listener implementation is well-structured:
- On success: updates visible checkpoint and processes any newer checkpoints received during replication
- On failure: logs the failure, handles shard failure if needed, otherwise processes latest checkpoint for retry
The removal of
isRetryfromprocessLatestReceivedCheckpoint(line 369) simplifies the retry path while maintaining the same behavior of attempting to process any newer checkpoints after a non-fatal failure.
483-509: LGTM!The
processLatestReceivedCheckpointmethod correctly delegates toonNewCheckpoint(line 497) without theisRetryflag. The thread-forking logic is preserved to avoid stack overflow on recursive checkpoint processing.
523-529: LGTM!The public
startReplicationmethod signature is now simplified to 3 parameters (IndexShard, ReplicationCheckpoint, SegmentReplicationListener), with the source obtained internally from the factory. This provides a cleaner API surface for callers.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
|
❌ Gradle check result for dccf9cc: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❗ AI-powered Code-Diff-Analyzer found issues on commit 413289e.
The table above displays the top 10 most important findings. Pull Requests Author(s): Please update your Pull Request according to the report above. Repository Maintainer(s): You can Thanks. |
Signed-off-by: guojialiang <guojialiang.2012@bytedance.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #20609 +/- ##
============================================
- Coverage 73.37% 73.37% -0.01%
- Complexity 72197 72248 +51
============================================
Files 5798 5798
Lines 329816 329814 -2
Branches 47538 47536 -2
============================================
- Hits 242001 241985 -16
- Misses 68458 68488 +30
+ Partials 19357 19341 -16 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| // See: https://github.com/opensearch-project/OpenSearch/issues/19234 | ||
| boolean isRecovering = indexShard.routingEntry().initializing() || indexShard.routingEntry().relocating(); | ||
| if (indexShard.indexSettings().isSegRepLocalEnabled() | ||
| && checkpoint.isAheadOf(getMetadataCheckpoint) |
There was a problem hiding this comment.
We cannot remove checkpoint.isAheadOf(getMetadataCheckpoint). isHandoffInProgress handles the relocation race but fails against split-brain or Zombie Primaries returning outdated metadata. Without this check, the replica will accept older metadata causing state rollback (deletion of committed segments). The safety check must stay.
Description
This PR is to address the issues in #[20610].
If we can specifically identify the scenario of primary shard relocation, we can avoid continuously patching the ckp verification logic.
Fortunately, we set state handoffInProgress during the hand-off phase of peer recovery. We can leverage this state to require that when the primary shard receives request
GET_CHECKPOINT_INFO, it must be a started primary shard that is not in the hand-off process.Related Issues
Resolves #[20610]
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.