Leveraging segment-global ordinal mapping for efficient terms aggrega…#20683
Leveraging segment-global ordinal mapping for efficient terms aggrega…#20683jainankitk wants to merge 7 commits intoopensearch-project:mainfrom
Conversation
…tion Signed-off-by: Ankit Jain <jainankitk@apache.org>
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughRefactors the term-frequency collection loop in Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
server/src/main/java/org/opensearch/search/aggregations/bucket/terms/GlobalOrdinalsStringTermsAggregator.java
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2026-01-13T17:40:27.167Z
Learnt from: reta
Repo: opensearch-project/OpenSearch PR: 20411
File: server/src/main/java/org/opensearch/index/codec/CodecService.java:112-133
Timestamp: 2026-01-13T17:40:27.167Z
Learning: Avoid capturing or evaluating a supplier (e.g., this::defaultCodec) upfront when passing it to a registry during object construction. If registries may replace defaults during iteration (as in EnginePlugin.getAdditionalCodecs), pass the supplier itself and only resolve it at use time. This ensures dynamic behavior is preserved during initialization and prevents premature binding of defaults in codecs/registry setup. This pattern should apply to similar initialization paths in Java server code where registries may mutate defaults during construction.
Applied to files:
server/src/main/java/org/opensearch/search/aggregations/bucket/terms/GlobalOrdinalsStringTermsAggregator.java
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: gradle-check
- GitHub Check: precommit (21, ubuntu-24.04-arm)
- GitHub Check: precommit (21, macos-15-intel)
- GitHub Check: precommit (25, macos-15)
- GitHub Check: precommit (21, macos-15)
- GitHub Check: precommit (25, ubuntu-latest)
- GitHub Check: precommit (21, windows-latest)
- GitHub Check: precommit (21, windows-2025, true)
- GitHub Check: precommit (25, windows-latest)
- GitHub Check: precommit (25, ubuntu-24.04-arm)
- GitHub Check: precommit (25, macos-15-intel)
- GitHub Check: precommit (21, ubuntu-latest)
- GitHub Check: detect-breaking-change
- GitHub Check: assemble (25, ubuntu-24.04-arm)
- GitHub Check: assemble (21, ubuntu-latest)
- GitHub Check: Analyze (java)
- GitHub Check: assemble (21, ubuntu-24.04-arm)
- GitHub Check: assemble (25, ubuntu-latest)
- GitHub Check: assemble (21, windows-latest)
- GitHub Check: assemble (25, windows-latest)
🔇 Additional comments (1)
server/src/main/java/org/opensearch/search/aggregations/bucket/terms/GlobalOrdinalsStringTermsAggregator.java (1)
218-218: The ordinal mapping pattern at line 218 is sound.GlobalOrdinalsIndexFieldDataexplicitly documents that segmentTermsEnumordinals are designed to work with theOrdinalMapfor global ordinals retrieval. The code is restricted toValuesSource.Bytes.WithOrdinals(keyword and IP fields), where both inverted index and doc values ordinals are lexicographically sorted and compatible. No verification needed—this is a supported framework pattern.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@server/src/main/java/org/opensearch/search/aggregations/bucket/terms/GlobalOrdinalsStringTermsAggregator.java`:
- Around line 94-95: The duplicate static import of NO_MORE_DOCS causes
ambiguous reference errors in GlobalOrdinalsStringTermsAggregator; remove the
deprecated static import of SortedSetDocValues.NO_MORE_DOCS and keep the static
import from DocIdSetIterator.NO_MORE_DOCS so unqualified uses of NO_MORE_DOCS
(e.g., in methods referencing NO_MORE_DOCS) resolve to
DocIdSetIterator.NO_MORE_DOCS; update the import block to drop the
SortedSetDocValues static import and ensure any references to
SortedSetDocValues.NO_MORE_DOCS are either removed or explicitly qualified to
DocIdSetIterator.NO_MORE_DOCS if needed.
- Around line 211-223: The loop in GlobalOrdinalsStringTermsAggregator now
passes global ordinals into ordCountConsumer, but
LowCardinality.tryPrecomputeAggregationForLeaf still assumes it will receive
segment ordinals and calls mapping.applyAsLong(ord), causing double-mapping;
update LowCardinality.tryPrecomputeAggregationForLeaf to accept the global
ordinal directly (i.e., stop calling mapping.applyAsLong on the ord passed into
the lambda) so it forwards the global ord to incrementBucketDocCount, and ensure
any references to mapping.applyAsLong in that method are removed or gated so
they only run when the incoming ord is a segment ord.
...ava/org/opensearch/search/aggregations/bucket/terms/GlobalOrdinalsStringTermsAggregator.java
Outdated
Show resolved
Hide resolved
...ava/org/opensearch/search/aggregations/bucket/terms/GlobalOrdinalsStringTermsAggregator.java
Show resolved
Hide resolved
|
❌ Gradle check result for b7a981d: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Ankit Jain <jainankitk@apache.org>
|
❌ Gradle check result for a480f25: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Ankit Jain <jainankitk@apache.org>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #20683 +/- ##
============================================
- Coverage 73.32% 73.17% -0.16%
+ Complexity 72064 71912 -152
============================================
Files 5781 5781
Lines 329395 329392 -3
Branches 47525 47523 -2
============================================
- Hits 241536 241025 -511
- Misses 68507 68985 +478
- Partials 19352 19382 +30 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Ankit Jain <jainankitk@apache.org>
| mapping = valuesSource.globalOrdinalsMapping(ctx); | ||
| return tryCollectFromTermFrequencies(ctx, (ord, docCount) -> incrementBucketDocCount(mapping.applyAsLong(ord), docCount)); | ||
| return tryCollectFromTermFrequencies( | ||
| ctx, | ||
| (globalOrd, docCount) -> incrementBucketDocCount(collectionStrategy.globalOrdToBucketOrd(0, globalOrd), docCount) | ||
| ); |
There was a problem hiding this comment.
This kind of seemed like another bug to me in the original implementation. Since, we were always passing the globalOrdinal to the consumer, using segment -> global mapping seems wrong
There was a problem hiding this comment.
+1! seems unnecessary. @sandeshkr419 could you verify?
Signed-off-by: Ankit Jain <jainankitk@apache.org>
| ordinalTerm = globalOrdinalTermsEnum.next(); | ||
| final TermsEnum segmentTermsEnum = segmentTerms.iterator(); | ||
| final LongUnaryOperator globalOrdsMapping = valuesSource.globalOrdinalsMapping(ctx); | ||
| segmentTermsEnum.next(); |
There was a problem hiding this comment.
why do we need this next()
There was a problem hiding this comment.
I believe the iterator is unpositioned without invoking next()
|
My only concern is the correctness, because this is based on the assumption that term dictionary ordinals match with term docvalues ordinals. In other word, the same term We do have a cluster setting to guard this OpenSearch/server/src/main/java/org/opensearch/search/SearchService.java Lines 445 to 451 in 38a030f From performance perspective, maybe can relax this setting to the cardinality of the field in clickbench terms query, where we noticed the regression, and trigger a benchmark to see |
Signed-off-by: Ankit Jain <jainankitk@apache.org>
PR Code Analyzer ❗AI-powered 'Code-Diff-Analyzer' found issues on commit a8d5956.
The table above displays the top 10 most important findings. Pull Requests Author(s): Please update your Pull Request according to the report above. Repository Maintainer(s): You can Thanks. |
|
{"run-benchmark-test": "id_16"} |
PR Reviewer Guide 🔍(Review updated until commit 95fe19b)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to 95fe19b
Previous suggestionsSuggestions up to commit a8d5956
|
|
{"run-benchmark-test": "id_16"} |
|
Persistent review updated to latest commit 95fe19b |
|
❌ Gradle check result for 95fe19b: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
…tion
Description
Address the terms aggregation performance regression using segment to global ordinals mapping without reading the terms from disk for per segment collection
Related Issues
Resolves #20626
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.