Skip to content

Fix content dropping in html_strip character filter (#20554)#20575

Open
Seungmin123 wants to merge 1 commit intoopensearch-project:mainfrom
Seungmin123:fix/html-strip-bug-20554
Open

Fix content dropping in html_strip character filter (#20554)#20575
Seungmin123 wants to merge 1 commit intoopensearch-project:mainfrom
Seungmin123:fix/html-strip-bug-20554

Conversation

@Seungmin123
Copy link

Description

This PR fixes a bug in the html_strip character filter (and ingest processor) where content was being incorrectly dropped when an HTML attribute value ended with an equals sign (=).

The root cause is a parsing issue in Lucene's HTMLStripCharFilter. To resolve this without forking the Lucene class, this change introduces a pre-processing step using PatternReplaceCharFilter. It identifies the problematic pattern (an equals sign followed by a double quote or whitespace) and escapes it to the HTML entity =. Since HTMLStripCharFilter decodes entities in its output text, this workaround is transparent to users while ensuring correct HTML stripping.

Related Issues

Resolves #20554

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…ct#20554)

The html_strip character filter (based on Lucene's HTMLStripCharFilter)
had a bug where it would drop content if an HTML attribute value
ended with an equals sign (=).

This fix pre-processes the input to escape the problematic equals sign
into an HTML entity (&opensearch-project#61;) when followed by a quote or space.
Since HTMLStripCharFilter decodes entities in its output, this is
transparent to the user while preventing the parsing bug.

Signed-off-by: Seungmin123 <dltmdals147@gmail.com>
@Seungmin123 Seungmin123 requested a review from a team as a code owner February 9, 2026 09:10
@github-actions github-actions bot added bug Something isn't working Search Search query, autocomplete ...etc labels Feb 9, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 9, 2026

📝 Walkthrough

Walkthrough

A bug fix for the html_strip character filter that drops content when HTML attribute values end with an equals sign. The fix introduces a PatternReplaceCharFilter that preprocesses input by replacing equals signs followed by quotes or whitespace with the HTML entity "=" before stripping HTML.

Changes

Cohort / File(s) Summary
Documentation
CHANGELOG.md
Added entry documenting the fix for content dropping in html_strip character filter.
Production Code
modules/analysis-common/src/main/java/org/opensearch/analysis/common/HtmlStripCharFilterFactory.java, modules/ingest-common/src/main/java/org/opensearch/ingest/common/HtmlStripProcessor.java
Introduced PatternReplaceCharFilter preprocessing to escape equals signs before HTML stripping using EQUALS_QUOTE_PATTERN regex matching =(?=["\\s]) and replacing with &#61;.
Test Additions
modules/analysis-common/src/test/java/org/opensearch/analysis/common/HtmlStripCharFilterTests.java, modules/ingest-common/src/test/java/org/opensearch/ingest/common/HtmlStripProcessorTests.java
Added comprehensive unit tests validating html_strip behavior with equals signs in attribute values, including edge cases (end of attribute, space after equals, HTML entity encoding).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly and concisely summarizes the main change: fixing content dropping in the html_strip character filter.
Description check ✅ Passed The PR description fully addresses the template requirements with a clear description of the bug, related issue number, and completed checklist items.
Linked Issues check ✅ Passed All code changes directly address the bug reported in #20554: the PatternReplaceCharFilter preprocessing prevents content dropping by escaping equals signs in attribute values.
Out of Scope Changes check ✅ Passed All changes are scoped to fixing the html_strip bug: CHANGELOG update, HtmlStripCharFilterFactory fix, HtmlStripProcessor fix, and corresponding test additions.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In
`@modules/analysis-common/src/main/java/org/opensearch/analysis/common/HtmlStripCharFilterFactory.java`:
- Line 52: Duplicate regex constant EQUALS_QUOTE_PATTERN is declared in
HtmlStripCharFilterFactory, HtmlStripProcessor, and HtmlStripCharFilterTests;
create a single shared public static final Pattern (e.g., EQUALS_QUOTE_PATTERN)
in a new or existing utility class (e.g., HtmlStripUtils or AnalysisCommonUtils)
and replace the local definitions in HtmlStripCharFilterFactory,
HtmlStripProcessor, and HtmlStripCharFilterTests to reference that shared
constant; ensure the new constant is package-visible or public as needed, update
imports/references accordingly, and run tests to verify no behavioral change.

In
`@modules/analysis-common/src/test/java/org/opensearch/analysis/common/HtmlStripCharFilterTests.java`:
- Around line 43-58: The test helper assertHtmlStrip currently opens an
HTMLStripCharFilter Reader (wrapping PatternReplaceCharFilter with
EQUALS_QUOTE_PATTERN) but never closes it; change assertHtmlStrip to use
try-with-resources to ensure the Reader is closed after reading, and after
building the output also add negative assertions that HTML tags (for example
check output does NOT contain "<a ") are absent in addition to the existing
positive contains checks; update references inside the method to use the
try-with-resources block around new HTMLStripCharFilter(...) and perform the
negative assertion(s) on the produced output string.

In
`@modules/ingest-common/src/test/java/org/opensearch/ingest/common/HtmlStripProcessorTests.java`:
- Around line 52-59: In testHtmlStripWithEqualsAtEndOfAttribute
(HtmlStripProcessorTests) add negative assertions to ensure tags are removed and
the "example" check isn't satisfied only by "example2": after invoking
processor.process(input) assert that the output does not contain raw tag markers
(e.g., "<a " and "</a>") to verify HTMLStripCharFilter still strips tags, and
tighten the "example" assertion so it verifies the standalone expected token
(not just a substring that could be provided by "example2") — update assertions
around the processor variable in the test method accordingly.
🧹 Nitpick comments (2)
modules/ingest-common/src/main/java/org/opensearch/ingest/common/HtmlStripProcessor.java (1)

47-47: Duplicated EQUALS_QUOTE_PATTERN — consider a shared location.

This pattern is identical to the one in HtmlStripCharFilterFactory. If the regex ever needs to be refined (see earlier comment about its breadth), both must be updated in lockstep. Consider extracting it to a shared utility class or constant.

modules/analysis-common/src/test/java/org/opensearch/analysis/common/HtmlStripCharFilterTests.java (1)

21-21: Third copy of EQUALS_QUOTE_PATTERN.

This is now the third location where this pattern is defined (also in HtmlStripCharFilterFactory and HtmlStripProcessor). For the test, consider importing or referencing the production constant (e.g., make it package-visible in the factory) rather than duplicating. At minimum, if it stays duplicated, add a comment pointing to the canonical definition.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8134f12 and dab6096.

📒 Files selected for processing (5)
  • CHANGELOG.md
  • modules/analysis-common/src/main/java/org/opensearch/analysis/common/HtmlStripCharFilterFactory.java
  • modules/analysis-common/src/test/java/org/opensearch/analysis/common/HtmlStripCharFilterTests.java
  • modules/ingest-common/src/main/java/org/opensearch/ingest/common/HtmlStripProcessor.java
  • modules/ingest-common/src/test/java/org/opensearch/ingest/common/HtmlStripProcessorTests.java
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: gradle-check
  • GitHub Check: Mend Security Check
🔇 Additional comments (4)
modules/analysis-common/src/main/java/org/opensearch/analysis/common/HtmlStripCharFilterFactory.java (1)

64-67: LGTM on the overall approach of wrapping with PatternReplaceCharFilter.

The preprocessing-then-strip strategy is a reasonable workaround for the upstream Lucene HTMLStripCharFilter parsing bug without requiring a Lucene fork. The entity &#61; round-trips correctly through HTMLStripCharFilter.

CHANGELOG.md (1)

16-16: LGTM!

Changelog entry is clear, correctly references the issue, and is in the right section.

modules/ingest-common/src/main/java/org/opensearch/ingest/common/HtmlStripProcessor.java (1)

54-74: LGTM on the filter composition and resource handling.

The try-with-resources on the outer HTMLStripCharFilter will cascade-close the inner PatternReplaceCharFilter and StringReader. The early-exit shortcut on line 56 remains valid since the = workaround is only relevant inside HTML tags.

modules/analysis-common/src/test/java/org/opensearch/analysis/common/HtmlStripCharFilterTests.java (1)

23-41: Good test coverage across edge cases.

The four tests cover the key scenarios: equals at end of attribute, equals mid-value, space after equals, and pre-encoded entity. This provides solid regression coverage for the bug fix.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.


public class HtmlStripCharFilterFactory extends AbstractCharFilterFactory {
private final Set<String> escapedTags;
private static final Pattern EQUALS_QUOTE_PATTERN = Pattern.compile("=(?=[\\\"\\s])");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check for EQUALS_QUOTE_PATTERN usage and duplication
echo "=== Searching for EQUALS_QUOTE_PATTERN ==="
rg -n "EQUALS_QUOTE_PATTERN" --type java

echo ""
echo "=== Check for HtmlStripProcessor.java ==="
find . -name "HtmlStripProcessor.java" -type f

echo ""
echo "=== Check for test files related to HtmlStrip ==="
find . -name "*HtmlStrip*Test*.java" -o -name "*HtmlStripCharFilter*Test.java"

echo ""
echo "=== Check for Pattern.compile.*=.* patterns in the codebase ==="
rg -n 'Pattern\.compile.*=.*\[' --type java | head -20

Repository: opensearch-project/OpenSearch

Length of output: 3978


🏁 Script executed:

#!/bin/bash
# Read the HtmlStripCharFilterFactory.java to understand context
echo "=== Reading HtmlStripCharFilterFactory.java context ==="
find . -name "HtmlStripCharFilterFactory.java" -type f | head -1 | xargs cat -n | head -80

Repository: opensearch-project/OpenSearch

Length of output: 3088


🏁 Script executed:

#!/bin/bash
# Check if HtmlStripProcessor.java exists and has the same pattern
echo "=== Checking HtmlStripProcessor.java for pattern ==="
find . -name "HtmlStripProcessor.java" -type f | xargs grep -n "EQUALS_QUOTE_PATTERN\|Pattern.compile.*=.*\[" 2>/dev/null || echo "File not found or pattern not present"

Repository: opensearch-project/OpenSearch

Length of output: 323


🏁 Script executed:

#!/bin/bash
# Read the test file to see what is being tested
echo "=== Reading HtmlStripCharFilterTests.java ==="
wc -l ./modules/analysis-common/src/test/java/org/opensearch/analysis/common/HtmlStripCharFilterTests.java

echo ""
cat -n ./modules/analysis-common/src/test/java/org/opensearch/analysis/common/HtmlStripCharFilterTests.java

Repository: opensearch-project/OpenSearch

Length of output: 3278


🏁 Script executed:

#!/bin/bash
# Check HtmlStripProcessor.java and HtmlStripProcessorTests.java
echo "=== Reading HtmlStripProcessor.java relevant sections ==="
cat -n ./modules/ingest-common/src/main/java/org/opensearch/ingest/common/HtmlStripProcessor.java | head -100

echo ""
echo "=== Reading HtmlStripProcessorTests.java test cases ==="
wc -l ./modules/ingest-common/src/test/java/org/opensearch/ingest/common/HtmlStripProcessorTests.java

cat -n ./modules/ingest-common/src/test/java/org/opensearch/ingest/common/HtmlStripProcessorTests.java

Repository: opensearch-project/OpenSearch

Length of output: 7037


Extract duplicate EQUALS_QUOTE_PATTERN to a shared utility constant.

This pattern is currently defined identically in three places: HtmlStripCharFilterFactory.java, HtmlStripProcessor.java, and HtmlStripCharFilterTests.java. The duplication creates maintenance risk if the regex ever needs adjustment. Tests in HtmlStripCharFilterTests.java confirm the pattern works correctly with HTML attributes (e.g., href="..." with embedded = characters), so extracting it to a shared utility will ensure consistency across all uses.

🤖 Prompt for AI Agents
In
`@modules/analysis-common/src/main/java/org/opensearch/analysis/common/HtmlStripCharFilterFactory.java`
at line 52, Duplicate regex constant EQUALS_QUOTE_PATTERN is declared in
HtmlStripCharFilterFactory, HtmlStripProcessor, and HtmlStripCharFilterTests;
create a single shared public static final Pattern (e.g., EQUALS_QUOTE_PATTERN)
in a new or existing utility class (e.g., HtmlStripUtils or AnalysisCommonUtils)
and replace the local definitions in HtmlStripCharFilterFactory,
HtmlStripProcessor, and HtmlStripCharFilterTests to reference that shared
constant; ensure the new constant is package-visible or public as needed, update
imports/references accordingly, and run tests to verify no behavioral change.

Comment on lines +43 to +58
private void assertHtmlStrip(String input, String... expectedSubstrings) throws IOException {
Reader reader = new HTMLStripCharFilter(
new PatternReplaceCharFilter(EQUALS_QUOTE_PATTERN, "&#61;", new StringReader(input)),
null
);
StringBuilder sb = new StringBuilder();
char[] buffer = new char[1024];
int len;
while ((len = reader.read(buffer)) != -1) {
sb.append(buffer, 0, len);
}
String output = sb.toString();
for (String expected : expectedSubstrings) {
assertTrue("Output should contain '" + expected + "'. Output was: " + output, output.contains(expected));
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Resource leak: Reader is never closed.

The HTMLStripCharFilter reader created on line 44 is never closed. While this is a test, it's still good practice. Use try-with-resources.

Also, same comment as the processor tests: add negative assertions verifying that HTML tags (e.g., <a ) are absent from the output.

Suggested fix
 private void assertHtmlStrip(String input, String... expectedSubstrings) throws IOException {
-    Reader reader = new HTMLStripCharFilter(
-        new PatternReplaceCharFilter(EQUALS_QUOTE_PATTERN, "&#61;", new StringReader(input)),
-        null
-    );
-    StringBuilder sb = new StringBuilder();
-    char[] buffer = new char[1024];
-    int len;
-    while ((len = reader.read(buffer)) != -1) {
-        sb.append(buffer, 0, len);
+    try (
+        Reader reader = new HTMLStripCharFilter(
+            new PatternReplaceCharFilter(EQUALS_QUOTE_PATTERN, "&#61;", new StringReader(input)),
+            null
+        )
+    ) {
+        StringBuilder sb = new StringBuilder();
+        char[] buffer = new char[1024];
+        int len;
+        while ((len = reader.read(buffer)) != -1) {
+            sb.append(buffer, 0, len);
+        }
+        String output = sb.toString();
+        for (String expected : expectedSubstrings) {
+            assertTrue("Output should contain '" + expected + "'. Output was: " + output, output.contains(expected));
+        }
+        assertFalse("Output should not contain HTML tags. Output was: " + output, output.contains("<a "));
     }
-    String output = sb.toString();
-    for (String expected : expectedSubstrings) {
-        assertTrue("Output should contain '" + expected + "'. Output was: " + output, output.contains(expected));
-    }
 }
🤖 Prompt for AI Agents
In
`@modules/analysis-common/src/test/java/org/opensearch/analysis/common/HtmlStripCharFilterTests.java`
around lines 43 - 58, The test helper assertHtmlStrip currently opens an
HTMLStripCharFilter Reader (wrapping PatternReplaceCharFilter with
EQUALS_QUOTE_PATTERN) but never closes it; change assertHtmlStrip to use
try-with-resources to ensure the Reader is closed after reading, and after
building the output also add negative assertions that HTML tags (for example
check output does NOT contain "<a ") are absent in addition to the existing
positive contains checks; update references inside the method to use the
try-with-resources block around new HTMLStripCharFilter(...) and perform the
negative assertion(s) on the produced output string.

Comment on lines +52 to +59
public void testHtmlStripWithEqualsAtEndOfAttribute() {
AbstractStringProcessor<String> processor = newProcessor("field", false, "target");
String input = "<a href=\"https://www.example.com/?test=\">example</a> this gets discarded<a href=\"https://www.example.com/\">example2</a> continues here";
String output = processor.process(input);
assertTrue("Output should contain 'this gets discarded'. Output was: " + output, output.contains("this gets discarded"));
assertTrue("Output should contain 'example'. Output was: " + output, output.contains("example"));
assertTrue("Output should contain 'example2'. Output was: " + output, output.contains("example2"));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a negative assertion to verify HTML tags are fully stripped.

The test confirms content is preserved but doesn't verify that the <a …> tags themselves are removed from the output. Given that the fix mangles attribute syntax before feeding it to HTMLStripCharFilter, it's important to confirm tags are still recognized and stripped. Also, the assertion for "example" will trivially pass if "example2" is present.

Suggested additions
     assertTrue("Output should contain 'example2'. Output was: " + output, output.contains("example2"));
+    assertFalse("Output should not contain HTML tags. Output was: " + output, output.contains("<a "));
+    assertFalse("Output should not contain closing tags. Output was: " + output, output.contains("</a>"));
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
public void testHtmlStripWithEqualsAtEndOfAttribute() {
AbstractStringProcessor<String> processor = newProcessor("field", false, "target");
String input = "<a href=\"https://www.example.com/?test=\">example</a> this gets discarded<a href=\"https://www.example.com/\">example2</a> continues here";
String output = processor.process(input);
assertTrue("Output should contain 'this gets discarded'. Output was: " + output, output.contains("this gets discarded"));
assertTrue("Output should contain 'example'. Output was: " + output, output.contains("example"));
assertTrue("Output should contain 'example2'. Output was: " + output, output.contains("example2"));
}
public void testHtmlStripWithEqualsAtEndOfAttribute() {
AbstractStringProcessor<String> processor = newProcessor("field", false, "target");
String input = "<a href=\"https://www.example.com/?test=\">example</a> this gets discarded<a href=\"https://www.example.com/\">example2</a> continues here";
String output = processor.process(input);
assertTrue("Output should contain 'this gets discarded'. Output was: " + output, output.contains("this gets discarded"));
assertTrue("Output should contain 'example'. Output was: " + output, output.contains("example"));
assertTrue("Output should contain 'example2'. Output was: " + output, output.contains("example2"));
assertFalse("Output should not contain HTML tags. Output was: " + output, output.contains("<a "));
assertFalse("Output should not contain closing tags. Output was: " + output, output.contains("</a>"));
}
🤖 Prompt for AI Agents
In
`@modules/ingest-common/src/test/java/org/opensearch/ingest/common/HtmlStripProcessorTests.java`
around lines 52 - 59, In testHtmlStripWithEqualsAtEndOfAttribute
(HtmlStripProcessorTests) add negative assertions to ensure tags are removed and
the "example" check isn't satisfied only by "example2": after invoking
processor.process(input) assert that the output does not contain raw tag markers
(e.g., "<a " and "</a>") to verify HTMLStripCharFilter still strips tags, and
tighten the "example" assertion so it verifies the standalone expected token
(not just a substring that could be provided by "example2") — update assertions
around the processor variable in the test method accordingly.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 9, 2026

❌ Gradle check result for dab6096: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@sandeshkr419
Copy link
Member

[Search Triage] @Seungmin123 Thanks for raising a PR. Did we try testing the HTML Strip Char filter in Lucene in isolation? Just to validate if this is a bug in Lucene and needs a fix in upstream.

@msfroh
Copy link
Contributor

msfroh commented Feb 11, 2026

@Seungmin123 -- can you please try creating a unit test for the Lucene HTMLStripCharFilter without any OpenSearch code?

In particular, if you clone the Lucene repo and add a failing test case to https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/test/org/apache/lucene/analysis/charfilter/TestHTMLStripCharFilter.java, then you can open an issue against Lucene and potentially supply the fix there. If the underlying Lucene class has a bug, we should fix it there so that everyone using Lucene gets the benefit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working lucene Search Search query, autocomplete ...etc

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] html_strip character filter issue

3 participants