Skip to content

Comments

Leveraging segment-global ordinal mapping for efficient terms aggrega…#20683

Open
jainankitk wants to merge 7 commits intoopensearch-project:mainfrom
jainankitk:fix-terms
Open

Leveraging segment-global ordinal mapping for efficient terms aggrega…#20683
jainankitk wants to merge 7 commits intoopensearch-project:mainfrom
jainankitk:fix-terms

Conversation

@jainankitk
Copy link
Contributor

…tion

Description

Address the terms aggregation performance regression using segment to global ordinals mapping without reading the terms from disk for per segment collection

Related Issues

Resolves #20626

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…tion

Signed-off-by: Ankit Jain <jainankitk@apache.org>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 20, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Refactors the term-frequency collection loop in GlobalOrdinalsStringTermsAggregator.tryCollectFromTermFrequencies to iterate only through segment terms while using a global ordinals mapping to derive ordinals, replacing the previous leap-frogging two-iterator approach that unnecessarily iterated through all global ordinals per segment.

Changes

Cohort / File(s) Summary
Ordinals Mapping Optimization
server/src/main/java/org/opensearch/search/aggregations/bucket/terms/GlobalOrdinalsStringTermsAggregator.java
Replaced leap-frogging algorithm with segment-driven iteration using valuesSource.globalOrdinalsMapping(ctx) to map segment ordinals to global ordinals. Consolidates two nested iterator loops into a single segmentTermsEnum loop that conditionally increments counts when global ordinals are accepted. Adds import for DocIdSetIterator.NO_MORE_DOCS.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

  • #20623: Also modifies GlobalOrdinalsStringTermsAggregator.tryCollectFromTermFrequencies to add cardinality threshold preconditions for skipping high-cardinality segments.

Suggested labels

v3.5.0, backport 3.5

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly summarizes the main change: leveraging segment-global ordinal mapping for efficient terms aggregation, which directly reflects the core refactoring described in the changeset.
Description check ✅ Passed The PR description addresses the template requirements with a clear description of the change, linked issue #20626, and a completed checklist confirming testing is included.
Linked Issues check ✅ Passed The code changes directly address the linked issue #20626 by replacing the leap-frogging algorithm with segment-driven term iteration and global ordinal mapping, eliminating the performance regression caused by iterating through millions of non-matching global ordinals.
Out of Scope Changes check ✅ Passed All code changes are scoped to the specific optimization described in the linked issue: refactoring GlobalOrdinalsStringTermsAggregator.tryCollectFromTermFrequencies to use segment-global ordinal mapping instead of the leap-frogging algorithm.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 107aa53 and b7a981d.

📒 Files selected for processing (1)
  • server/src/main/java/org/opensearch/search/aggregations/bucket/terms/GlobalOrdinalsStringTermsAggregator.java
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2026-01-13T17:40:27.167Z
Learnt from: reta
Repo: opensearch-project/OpenSearch PR: 20411
File: server/src/main/java/org/opensearch/index/codec/CodecService.java:112-133
Timestamp: 2026-01-13T17:40:27.167Z
Learning: Avoid capturing or evaluating a supplier (e.g., this::defaultCodec) upfront when passing it to a registry during object construction. If registries may replace defaults during iteration (as in EnginePlugin.getAdditionalCodecs), pass the supplier itself and only resolve it at use time. This ensures dynamic behavior is preserved during initialization and prevents premature binding of defaults in codecs/registry setup. This pattern should apply to similar initialization paths in Java server code where registries may mutate defaults during construction.

Applied to files:

  • server/src/main/java/org/opensearch/search/aggregations/bucket/terms/GlobalOrdinalsStringTermsAggregator.java
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: gradle-check
  • GitHub Check: precommit (21, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, macos-15-intel)
  • GitHub Check: precommit (25, macos-15)
  • GitHub Check: precommit (21, macos-15)
  • GitHub Check: precommit (25, ubuntu-latest)
  • GitHub Check: precommit (21, windows-latest)
  • GitHub Check: precommit (21, windows-2025, true)
  • GitHub Check: precommit (25, windows-latest)
  • GitHub Check: precommit (25, ubuntu-24.04-arm)
  • GitHub Check: precommit (25, macos-15-intel)
  • GitHub Check: precommit (21, ubuntu-latest)
  • GitHub Check: detect-breaking-change
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (21, ubuntu-latest)
  • GitHub Check: Analyze (java)
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: assemble (21, windows-latest)
  • GitHub Check: assemble (25, windows-latest)
🔇 Additional comments (1)
server/src/main/java/org/opensearch/search/aggregations/bucket/terms/GlobalOrdinalsStringTermsAggregator.java (1)

218-218: The ordinal mapping pattern at line 218 is sound. GlobalOrdinalsIndexFieldData explicitly documents that segment TermsEnum ordinals are designed to work with the OrdinalMap for global ordinals retrieval. The code is restricted to ValuesSource.Bytes.WithOrdinals (keyword and IP fields), where both inverted index and doc values ordinals are lexicographically sorted and compatible. No verification needed—this is a supported framework pattern.

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@server/src/main/java/org/opensearch/search/aggregations/bucket/terms/GlobalOrdinalsStringTermsAggregator.java`:
- Around line 94-95: The duplicate static import of NO_MORE_DOCS causes
ambiguous reference errors in GlobalOrdinalsStringTermsAggregator; remove the
deprecated static import of SortedSetDocValues.NO_MORE_DOCS and keep the static
import from DocIdSetIterator.NO_MORE_DOCS so unqualified uses of NO_MORE_DOCS
(e.g., in methods referencing NO_MORE_DOCS) resolve to
DocIdSetIterator.NO_MORE_DOCS; update the import block to drop the
SortedSetDocValues static import and ensure any references to
SortedSetDocValues.NO_MORE_DOCS are either removed or explicitly qualified to
DocIdSetIterator.NO_MORE_DOCS if needed.
- Around line 211-223: The loop in GlobalOrdinalsStringTermsAggregator now
passes global ordinals into ordCountConsumer, but
LowCardinality.tryPrecomputeAggregationForLeaf still assumes it will receive
segment ordinals and calls mapping.applyAsLong(ord), causing double-mapping;
update LowCardinality.tryPrecomputeAggregationForLeaf to accept the global
ordinal directly (i.e., stop calling mapping.applyAsLong on the ord passed into
the lambda) so it forwards the global ord to incrementBucketDocCount, and ensure
any references to mapping.applyAsLong in that method are removed or gated so
they only run when the incoming ord is a segment ord.

@github-actions
Copy link
Contributor

❌ Gradle check result for b7a981d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Ankit Jain <jainankitk@apache.org>
@github-actions
Copy link
Contributor

❌ Gradle check result for a480f25: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Ankit Jain <jainankitk@apache.org>
@github-actions
Copy link
Contributor

✅ Gradle check result for 5a1f4c0: SUCCESS

@codecov
Copy link

codecov bot commented Feb 20, 2026

Codecov Report

❌ Patch coverage is 90.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 73.17%. Comparing base (6b557db) to head (4416b20).
⚠️ Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
...ket/terms/GlobalOrdinalsStringTermsAggregator.java 90.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #20683      +/-   ##
============================================
- Coverage     73.32%   73.17%   -0.16%     
+ Complexity    72064    71912     -152     
============================================
  Files          5781     5781              
  Lines        329395   329392       -3     
  Branches      47525    47523       -2     
============================================
- Hits         241536   241025     -511     
- Misses        68507    68985     +478     
- Partials      19352    19382      +30     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Ankit Jain <jainankitk@apache.org>
Comment on lines -571 to +563
mapping = valuesSource.globalOrdinalsMapping(ctx);
return tryCollectFromTermFrequencies(ctx, (ord, docCount) -> incrementBucketDocCount(mapping.applyAsLong(ord), docCount));
return tryCollectFromTermFrequencies(
ctx,
(globalOrd, docCount) -> incrementBucketDocCount(collectionStrategy.globalOrdToBucketOrd(0, globalOrd), docCount)
);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of seemed like another bug to me in the original implementation. Since, we were always passing the globalOrdinal to the consumer, using segment -> global mapping seems wrong

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1! seems unnecessary. @sandeshkr419 could you verify?

Signed-off-by: Ankit Jain <jainankitk@apache.org>
@github-actions
Copy link
Contributor

✅ Gradle check result for 4416b20: SUCCESS

ordinalTerm = globalOrdinalTermsEnum.next();
final TermsEnum segmentTermsEnum = segmentTerms.iterator();
final LongUnaryOperator globalOrdsMapping = valuesSource.globalOrdinalsMapping(ctx);
segmentTermsEnum.next();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this next()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the iterator is unpositioned without invoking next()

@bowenlan-amzn
Copy link
Member

My only concern is the correctness, because this is based on the assumption that term dictionary ordinals match with term docvalues ordinals. In other word, the same term foo is same ordinal like 23 in both term dictionary order and term docvalues ordinals.

We do have a cluster setting to guard this

public static final Setting<Long> TERMS_AGGREGATION_MAX_PRECOMPUTE_CARDINALITY = Setting.longSetting(
"search.aggregations.terms.max_precompute_cardinality",
30_000L,
0L,
Property.Dynamic,
Property.NodeScope
);

From performance perspective, maybe can relax this setting to the cardinality of the field in clickbench terms query, where we noticed the regression, and trigger a benchmark to see

https://github.com/opensearch-project/opensearch-benchmark-workloads/blob/adf94810a0b6802a1efe92719eb2139b7a46ae13/clickbench/operations/dsl.json#L1849-L1908

Signed-off-by: Ankit Jain <jainankitk@apache.org>
@github-actions
Copy link
Contributor

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit a8d5956.

PathLineSeverityDescription
server/src/main/java/org/opensearch/search/SearchService.java447mediumCardinality limit increased by 1,000,000x (30K to 30B). While aligned with the stated optimization feature, this extreme increase could enable resource exhaustion DoS attacks if exploited. Verify additional safeguards exist to prevent memory/CPU exhaustion from massive aggregations.

The table above displays the top 10 most important findings.

Total: 1 | Critical: 0 | High: 0 | Medium: 1 | Low: 0


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@jainankitk
Copy link
Contributor Author

{"run-benchmark-test": "id_16"}

@jainankitk jainankitk added the skip-diff-analyzer Maintainer to skip code-diff-analyzer check, after reviewing issues in AI analysis. label Feb 21, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Feb 21, 2026

PR Reviewer Guide 🔍

(Review updated until commit 95fe19b)

Here are some key observations to aid the review process:

🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Missing Validation

The commented assertion at line 223 suggests uncertainty about whether segmentTermsEnum is exhausted after the loop. This should be validated or the assertion should be implemented to ensure correctness of the ordinal iteration logic.

for (long segmentOrd = 0; segmentOrd < termCount; segmentOrd++) {
    long globalOrd = globalOrdsMapping.applyAsLong(segmentOrd);
    if (acceptedGlobalOrdinals.test(globalOrd)) {
        ordCountConsumer.accept(globalOrd, segmentTermsEnum.docFreq());
    }
    segmentTermsEnum.next();
}
// assert segmentTermsEnum is exhausted?
Potential Bug

The new implementation calls segmentTermsEnum.next() before the loop (line 212) and then again at the end of each iteration (line 221). This means the first term is skipped. The old implementation correctly processed all terms by checking indexTerm != null after the initial next() call.

final TermsEnum segmentTermsEnum = segmentTerms.iterator();
final LongUnaryOperator globalOrdsMapping = valuesSource.globalOrdinalsMapping(ctx);
segmentTermsEnum.next();

// Iterate over the ordinals in the segment, look for matches in the global ordinal,
// and increment bucket count when segment ordinal is contained in global ordinals.
for (long segmentOrd = 0; segmentOrd < termCount; segmentOrd++) {
    long globalOrd = globalOrdsMapping.applyAsLong(segmentOrd);
    if (acceptedGlobalOrdinals.test(globalOrd)) {
        ordCountConsumer.accept(globalOrd, segmentTermsEnum.docFreq());
    }
    segmentTermsEnum.next();
}
Large Default Change

The default value for TERMS_AGGREGATION_MAX_PRECOMPUTE_CARDINALITY increased from 30,000 to 30 billion (1 million times larger). This dramatic change could significantly impact memory usage and performance. Ensure this is intentional and properly tested across different deployment scenarios.

30_000_000_000L,
Removed Logic

The LowCardinality class removed the mapSegmentCountsToGlobalCounts call and the mapping field management. Verify that this logic is no longer needed or has been properly relocated, as it may have been handling important state transitions between segments.

return tryCollectFromTermFrequencies(
    ctx,
    (globalOrd, docCount) -> incrementBucketDocCount(collectionStrategy.globalOrdToBucketOrd(0, globalOrd), docCount)
);

@github-actions
Copy link
Contributor

github-actions bot commented Feb 21, 2026

PR Code Suggestions ✨

Latest suggestions up to 95fe19b
Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix term enumeration alignment issue

The initial segmentTermsEnum.next() call before the loop may skip the first term,
causing misalignment between segmentOrd and the actual term position. Move the
next() call to the end of the loop body or verify that the first term is at ordinal
0 without requiring an initial advance.

server/src/main/java/org/opensearch/search/aggregations/bucket/terms/GlobalOrdinalsStringTermsAggregator.java [210-222]

 final TermsEnum segmentTermsEnum = segmentTerms.iterator();
 final LongUnaryOperator globalOrdsMapping = valuesSource.globalOrdinalsMapping(ctx);
-segmentTermsEnum.next();
 
 // Iterate over the ordinals in the segment, look for matches in the global ordinal,
 // and increment bucket count when segment ordinal is contained in global ordinals.
 for (long segmentOrd = 0; segmentOrd < termCount; segmentOrd++) {
+    if (segmentTermsEnum.next() == null) break;
     long globalOrd = globalOrdsMapping.applyAsLong(segmentOrd);
     if (acceptedGlobalOrdinals.test(globalOrd)) {
         ordCountConsumer.accept(globalOrd, segmentTermsEnum.docFreq());
     }
-    segmentTermsEnum.next();
 }
Suggestion importance[1-10]: 9

__

Why: This identifies a critical bug where the initial segmentTermsEnum.next() call before the loop causes misalignment between segmentOrd and term positions, potentially leading to incorrect aggregation results. The suggested fix properly synchronizes the enumeration with ordinal iteration.

High

Previous suggestions

Suggestions up to commit a8d5956
CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix term enumeration alignment issue

The first segmentTermsEnum.next() call before the loop may skip the first term
without processing it. This could lead to misalignment between segmentOrd and the
actual term position. Remove the initial next() call and move it to the end of the
loop body to ensure all terms are processed correctly.

server/src/main/java/org/opensearch/search/aggregations/bucket/terms/GlobalOrdinalsStringTermsAggregator.java [210-222]

 final TermsEnum segmentTermsEnum = segmentTerms.iterator();
 final LongUnaryOperator globalOrdsMapping = valuesSource.globalOrdinalsMapping(ctx);
-segmentTermsEnum.next();
 
 // Iterate over the ordinals in the segment, look for matches in the global ordinal,
 // and increment bucket count when segment ordinal is contained in global ordinals.
 for (long segmentOrd = 0; segmentOrd < termCount; segmentOrd++) {
+    segmentTermsEnum.next();
     long globalOrd = globalOrdsMapping.applyAsLong(segmentOrd);
     if (acceptedGlobalOrdinals.test(globalOrd)) {
         ordCountConsumer.accept(globalOrd, segmentTermsEnum.docFreq());
     }
-    segmentTermsEnum.next();
 }
Suggestion importance[1-10]: 9

__

Why: This is a critical bug fix. The current code calls segmentTermsEnum.next() before the loop, which skips the first term (ordinal 0), causing misalignment between segmentOrd and the actual term position. This would lead to incorrect aggregation results. Moving the next() call inside the loop ensures proper alignment.

High
General
Add upper bound for cardinality

The new default value of 30 billion is extremely high and may cause memory issues or
performance degradation. Consider adding an upper bound validation or documenting
the memory implications of such a large cardinality threshold to prevent resource
exhaustion.

server/src/main/java/org/opensearch/search/SearchService.java [445-451]

 public static final Setting<Long> TERMS_AGGREGATION_MAX_PRECOMPUTE_CARDINALITY = Setting.longSetting(
     "search.aggregations.terms.max_precompute_cardinality",
     30_000_000_000L,
     0L,
+    Long.MAX_VALUE,
     Property.Dynamic,
     Property.NodeScope
 );
Suggestion importance[1-10]: 2

__

Why: The suggestion adds Long.MAX_VALUE as an upper bound, but this doesn't actually provide any meaningful constraint since it's the maximum possible value. The PR intentionally increases the threshold from 30,000 to 30 billion, and adding Long.MAX_VALUE doesn't address the suggested concern about memory implications.

Low

@jainankitk
Copy link
Contributor Author

{"run-benchmark-test": "id_16"}

@github-actions
Copy link
Contributor

Persistent review updated to latest commit 95fe19b

@github-actions
Copy link
Contributor

❌ Gradle check result for 95fe19b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working Search:Performance skip-diff-analyzer Maintainer to skip code-diff-analyzer check, after reviewing issues in AI analysis.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Performance regression in terms aggregation with match_all queries

3 participants