Fix content dropping in html_strip character filter (#20554)#20575
Fix content dropping in html_strip character filter (#20554)#20575Seungmin123 wants to merge 1 commit intoopensearch-project:mainfrom
Conversation
…ct#20554) The html_strip character filter (based on Lucene's HTMLStripCharFilter) had a bug where it would drop content if an HTML attribute value ended with an equals sign (=). This fix pre-processes the input to escape the problematic equals sign into an HTML entity (&opensearch-project#61;) when followed by a quote or space. Since HTMLStripCharFilter decodes entities in its output, this is transparent to the user while preventing the parsing bug. Signed-off-by: Seungmin123 <dltmdals147@gmail.com>
📝 WalkthroughWalkthroughA bug fix for the html_strip character filter that drops content when HTML attribute values end with an equals sign. The fix introduces a PatternReplaceCharFilter that preprocesses input by replacing equals signs followed by quotes or whitespace with the HTML entity "=" before stripping HTML. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In
`@modules/analysis-common/src/main/java/org/opensearch/analysis/common/HtmlStripCharFilterFactory.java`:
- Line 52: Duplicate regex constant EQUALS_QUOTE_PATTERN is declared in
HtmlStripCharFilterFactory, HtmlStripProcessor, and HtmlStripCharFilterTests;
create a single shared public static final Pattern (e.g., EQUALS_QUOTE_PATTERN)
in a new or existing utility class (e.g., HtmlStripUtils or AnalysisCommonUtils)
and replace the local definitions in HtmlStripCharFilterFactory,
HtmlStripProcessor, and HtmlStripCharFilterTests to reference that shared
constant; ensure the new constant is package-visible or public as needed, update
imports/references accordingly, and run tests to verify no behavioral change.
In
`@modules/analysis-common/src/test/java/org/opensearch/analysis/common/HtmlStripCharFilterTests.java`:
- Around line 43-58: The test helper assertHtmlStrip currently opens an
HTMLStripCharFilter Reader (wrapping PatternReplaceCharFilter with
EQUALS_QUOTE_PATTERN) but never closes it; change assertHtmlStrip to use
try-with-resources to ensure the Reader is closed after reading, and after
building the output also add negative assertions that HTML tags (for example
check output does NOT contain "<a ") are absent in addition to the existing
positive contains checks; update references inside the method to use the
try-with-resources block around new HTMLStripCharFilter(...) and perform the
negative assertion(s) on the produced output string.
In
`@modules/ingest-common/src/test/java/org/opensearch/ingest/common/HtmlStripProcessorTests.java`:
- Around line 52-59: In testHtmlStripWithEqualsAtEndOfAttribute
(HtmlStripProcessorTests) add negative assertions to ensure tags are removed and
the "example" check isn't satisfied only by "example2": after invoking
processor.process(input) assert that the output does not contain raw tag markers
(e.g., "<a " and "</a>") to verify HTMLStripCharFilter still strips tags, and
tighten the "example" assertion so it verifies the standalone expected token
(not just a substring that could be provided by "example2") — update assertions
around the processor variable in the test method accordingly.
🧹 Nitpick comments (2)
modules/ingest-common/src/main/java/org/opensearch/ingest/common/HtmlStripProcessor.java (1)
47-47: DuplicatedEQUALS_QUOTE_PATTERN— consider a shared location.This pattern is identical to the one in
HtmlStripCharFilterFactory. If the regex ever needs to be refined (see earlier comment about its breadth), both must be updated in lockstep. Consider extracting it to a shared utility class or constant.modules/analysis-common/src/test/java/org/opensearch/analysis/common/HtmlStripCharFilterTests.java (1)
21-21: Third copy ofEQUALS_QUOTE_PATTERN.This is now the third location where this pattern is defined (also in
HtmlStripCharFilterFactoryandHtmlStripProcessor). For the test, consider importing or referencing the production constant (e.g., make it package-visible in the factory) rather than duplicating. At minimum, if it stays duplicated, add a comment pointing to the canonical definition.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
CHANGELOG.mdmodules/analysis-common/src/main/java/org/opensearch/analysis/common/HtmlStripCharFilterFactory.javamodules/analysis-common/src/test/java/org/opensearch/analysis/common/HtmlStripCharFilterTests.javamodules/ingest-common/src/main/java/org/opensearch/ingest/common/HtmlStripProcessor.javamodules/ingest-common/src/test/java/org/opensearch/ingest/common/HtmlStripProcessorTests.java
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: gradle-check
- GitHub Check: Mend Security Check
🔇 Additional comments (4)
modules/analysis-common/src/main/java/org/opensearch/analysis/common/HtmlStripCharFilterFactory.java (1)
64-67: LGTM on the overall approach of wrapping withPatternReplaceCharFilter.The preprocessing-then-strip strategy is a reasonable workaround for the upstream Lucene
HTMLStripCharFilterparsing bug without requiring a Lucene fork. The entity=round-trips correctly throughHTMLStripCharFilter.CHANGELOG.md (1)
16-16: LGTM!Changelog entry is clear, correctly references the issue, and is in the right section.
modules/ingest-common/src/main/java/org/opensearch/ingest/common/HtmlStripProcessor.java (1)
54-74: LGTM on the filter composition and resource handling.The
try-with-resourceson the outerHTMLStripCharFilterwill cascade-close the innerPatternReplaceCharFilterandStringReader. The early-exit shortcut on line 56 remains valid since the=workaround is only relevant inside HTML tags.modules/analysis-common/src/test/java/org/opensearch/analysis/common/HtmlStripCharFilterTests.java (1)
23-41: Good test coverage across edge cases.The four tests cover the key scenarios: equals at end of attribute, equals mid-value, space after equals, and pre-encoded entity. This provides solid regression coverage for the bug fix.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
|
|
||
| public class HtmlStripCharFilterFactory extends AbstractCharFilterFactory { | ||
| private final Set<String> escapedTags; | ||
| private static final Pattern EQUALS_QUOTE_PATTERN = Pattern.compile("=(?=[\\\"\\s])"); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check for EQUALS_QUOTE_PATTERN usage and duplication
echo "=== Searching for EQUALS_QUOTE_PATTERN ==="
rg -n "EQUALS_QUOTE_PATTERN" --type java
echo ""
echo "=== Check for HtmlStripProcessor.java ==="
find . -name "HtmlStripProcessor.java" -type f
echo ""
echo "=== Check for test files related to HtmlStrip ==="
find . -name "*HtmlStrip*Test*.java" -o -name "*HtmlStripCharFilter*Test.java"
echo ""
echo "=== Check for Pattern.compile.*=.* patterns in the codebase ==="
rg -n 'Pattern\.compile.*=.*\[' --type java | head -20Repository: opensearch-project/OpenSearch
Length of output: 3978
🏁 Script executed:
#!/bin/bash
# Read the HtmlStripCharFilterFactory.java to understand context
echo "=== Reading HtmlStripCharFilterFactory.java context ==="
find . -name "HtmlStripCharFilterFactory.java" -type f | head -1 | xargs cat -n | head -80Repository: opensearch-project/OpenSearch
Length of output: 3088
🏁 Script executed:
#!/bin/bash
# Check if HtmlStripProcessor.java exists and has the same pattern
echo "=== Checking HtmlStripProcessor.java for pattern ==="
find . -name "HtmlStripProcessor.java" -type f | xargs grep -n "EQUALS_QUOTE_PATTERN\|Pattern.compile.*=.*\[" 2>/dev/null || echo "File not found or pattern not present"Repository: opensearch-project/OpenSearch
Length of output: 323
🏁 Script executed:
#!/bin/bash
# Read the test file to see what is being tested
echo "=== Reading HtmlStripCharFilterTests.java ==="
wc -l ./modules/analysis-common/src/test/java/org/opensearch/analysis/common/HtmlStripCharFilterTests.java
echo ""
cat -n ./modules/analysis-common/src/test/java/org/opensearch/analysis/common/HtmlStripCharFilterTests.javaRepository: opensearch-project/OpenSearch
Length of output: 3278
🏁 Script executed:
#!/bin/bash
# Check HtmlStripProcessor.java and HtmlStripProcessorTests.java
echo "=== Reading HtmlStripProcessor.java relevant sections ==="
cat -n ./modules/ingest-common/src/main/java/org/opensearch/ingest/common/HtmlStripProcessor.java | head -100
echo ""
echo "=== Reading HtmlStripProcessorTests.java test cases ==="
wc -l ./modules/ingest-common/src/test/java/org/opensearch/ingest/common/HtmlStripProcessorTests.java
cat -n ./modules/ingest-common/src/test/java/org/opensearch/ingest/common/HtmlStripProcessorTests.javaRepository: opensearch-project/OpenSearch
Length of output: 7037
Extract duplicate EQUALS_QUOTE_PATTERN to a shared utility constant.
This pattern is currently defined identically in three places: HtmlStripCharFilterFactory.java, HtmlStripProcessor.java, and HtmlStripCharFilterTests.java. The duplication creates maintenance risk if the regex ever needs adjustment. Tests in HtmlStripCharFilterTests.java confirm the pattern works correctly with HTML attributes (e.g., href="..." with embedded = characters), so extracting it to a shared utility will ensure consistency across all uses.
🤖 Prompt for AI Agents
In
`@modules/analysis-common/src/main/java/org/opensearch/analysis/common/HtmlStripCharFilterFactory.java`
at line 52, Duplicate regex constant EQUALS_QUOTE_PATTERN is declared in
HtmlStripCharFilterFactory, HtmlStripProcessor, and HtmlStripCharFilterTests;
create a single shared public static final Pattern (e.g., EQUALS_QUOTE_PATTERN)
in a new or existing utility class (e.g., HtmlStripUtils or AnalysisCommonUtils)
and replace the local definitions in HtmlStripCharFilterFactory,
HtmlStripProcessor, and HtmlStripCharFilterTests to reference that shared
constant; ensure the new constant is package-visible or public as needed, update
imports/references accordingly, and run tests to verify no behavioral change.
| private void assertHtmlStrip(String input, String... expectedSubstrings) throws IOException { | ||
| Reader reader = new HTMLStripCharFilter( | ||
| new PatternReplaceCharFilter(EQUALS_QUOTE_PATTERN, "=", new StringReader(input)), | ||
| null | ||
| ); | ||
| StringBuilder sb = new StringBuilder(); | ||
| char[] buffer = new char[1024]; | ||
| int len; | ||
| while ((len = reader.read(buffer)) != -1) { | ||
| sb.append(buffer, 0, len); | ||
| } | ||
| String output = sb.toString(); | ||
| for (String expected : expectedSubstrings) { | ||
| assertTrue("Output should contain '" + expected + "'. Output was: " + output, output.contains(expected)); | ||
| } | ||
| } |
There was a problem hiding this comment.
Resource leak: Reader is never closed.
The HTMLStripCharFilter reader created on line 44 is never closed. While this is a test, it's still good practice. Use try-with-resources.
Also, same comment as the processor tests: add negative assertions verifying that HTML tags (e.g., <a ) are absent from the output.
Suggested fix
private void assertHtmlStrip(String input, String... expectedSubstrings) throws IOException {
- Reader reader = new HTMLStripCharFilter(
- new PatternReplaceCharFilter(EQUALS_QUOTE_PATTERN, "=", new StringReader(input)),
- null
- );
- StringBuilder sb = new StringBuilder();
- char[] buffer = new char[1024];
- int len;
- while ((len = reader.read(buffer)) != -1) {
- sb.append(buffer, 0, len);
+ try (
+ Reader reader = new HTMLStripCharFilter(
+ new PatternReplaceCharFilter(EQUALS_QUOTE_PATTERN, "=", new StringReader(input)),
+ null
+ )
+ ) {
+ StringBuilder sb = new StringBuilder();
+ char[] buffer = new char[1024];
+ int len;
+ while ((len = reader.read(buffer)) != -1) {
+ sb.append(buffer, 0, len);
+ }
+ String output = sb.toString();
+ for (String expected : expectedSubstrings) {
+ assertTrue("Output should contain '" + expected + "'. Output was: " + output, output.contains(expected));
+ }
+ assertFalse("Output should not contain HTML tags. Output was: " + output, output.contains("<a "));
}
- String output = sb.toString();
- for (String expected : expectedSubstrings) {
- assertTrue("Output should contain '" + expected + "'. Output was: " + output, output.contains(expected));
- }
}🤖 Prompt for AI Agents
In
`@modules/analysis-common/src/test/java/org/opensearch/analysis/common/HtmlStripCharFilterTests.java`
around lines 43 - 58, The test helper assertHtmlStrip currently opens an
HTMLStripCharFilter Reader (wrapping PatternReplaceCharFilter with
EQUALS_QUOTE_PATTERN) but never closes it; change assertHtmlStrip to use
try-with-resources to ensure the Reader is closed after reading, and after
building the output also add negative assertions that HTML tags (for example
check output does NOT contain "<a ") are absent in addition to the existing
positive contains checks; update references inside the method to use the
try-with-resources block around new HTMLStripCharFilter(...) and perform the
negative assertion(s) on the produced output string.
| public void testHtmlStripWithEqualsAtEndOfAttribute() { | ||
| AbstractStringProcessor<String> processor = newProcessor("field", false, "target"); | ||
| String input = "<a href=\"https://www.example.com/?test=\">example</a> this gets discarded<a href=\"https://www.example.com/\">example2</a> continues here"; | ||
| String output = processor.process(input); | ||
| assertTrue("Output should contain 'this gets discarded'. Output was: " + output, output.contains("this gets discarded")); | ||
| assertTrue("Output should contain 'example'. Output was: " + output, output.contains("example")); | ||
| assertTrue("Output should contain 'example2'. Output was: " + output, output.contains("example2")); | ||
| } |
There was a problem hiding this comment.
Add a negative assertion to verify HTML tags are fully stripped.
The test confirms content is preserved but doesn't verify that the <a …> tags themselves are removed from the output. Given that the fix mangles attribute syntax before feeding it to HTMLStripCharFilter, it's important to confirm tags are still recognized and stripped. Also, the assertion for "example" will trivially pass if "example2" is present.
Suggested additions
assertTrue("Output should contain 'example2'. Output was: " + output, output.contains("example2"));
+ assertFalse("Output should not contain HTML tags. Output was: " + output, output.contains("<a "));
+ assertFalse("Output should not contain closing tags. Output was: " + output, output.contains("</a>"));
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| public void testHtmlStripWithEqualsAtEndOfAttribute() { | |
| AbstractStringProcessor<String> processor = newProcessor("field", false, "target"); | |
| String input = "<a href=\"https://www.example.com/?test=\">example</a> this gets discarded<a href=\"https://www.example.com/\">example2</a> continues here"; | |
| String output = processor.process(input); | |
| assertTrue("Output should contain 'this gets discarded'. Output was: " + output, output.contains("this gets discarded")); | |
| assertTrue("Output should contain 'example'. Output was: " + output, output.contains("example")); | |
| assertTrue("Output should contain 'example2'. Output was: " + output, output.contains("example2")); | |
| } | |
| public void testHtmlStripWithEqualsAtEndOfAttribute() { | |
| AbstractStringProcessor<String> processor = newProcessor("field", false, "target"); | |
| String input = "<a href=\"https://www.example.com/?test=\">example</a> this gets discarded<a href=\"https://www.example.com/\">example2</a> continues here"; | |
| String output = processor.process(input); | |
| assertTrue("Output should contain 'this gets discarded'. Output was: " + output, output.contains("this gets discarded")); | |
| assertTrue("Output should contain 'example'. Output was: " + output, output.contains("example")); | |
| assertTrue("Output should contain 'example2'. Output was: " + output, output.contains("example2")); | |
| assertFalse("Output should not contain HTML tags. Output was: " + output, output.contains("<a ")); | |
| assertFalse("Output should not contain closing tags. Output was: " + output, output.contains("</a>")); | |
| } |
🤖 Prompt for AI Agents
In
`@modules/ingest-common/src/test/java/org/opensearch/ingest/common/HtmlStripProcessorTests.java`
around lines 52 - 59, In testHtmlStripWithEqualsAtEndOfAttribute
(HtmlStripProcessorTests) add negative assertions to ensure tags are removed and
the "example" check isn't satisfied only by "example2": after invoking
processor.process(input) assert that the output does not contain raw tag markers
(e.g., "<a " and "</a>") to verify HTMLStripCharFilter still strips tags, and
tighten the "example" assertion so it verifies the standalone expected token
(not just a substring that could be provided by "example2") — update assertions
around the processor variable in the test method accordingly.
|
❌ Gradle check result for dab6096: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
[Search Triage] @Seungmin123 Thanks for raising a PR. Did we try testing the HTML Strip Char filter in Lucene in isolation? Just to validate if this is a bug in Lucene and needs a fix in upstream. |
|
@Seungmin123 -- can you please try creating a unit test for the Lucene In particular, if you clone the Lucene repo and add a failing test case to https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/test/org/apache/lucene/analysis/charfilter/TestHTMLStripCharFilter.java, then you can open an issue against Lucene and potentially supply the fix there. If the underlying Lucene class has a bug, we should fix it there so that everyone using Lucene gets the benefit. |
Description
This PR fixes a bug in the
html_stripcharacter filter (and ingest processor) where content was being incorrectly dropped when an HTML attribute value ended with an equals sign (=).The root cause is a parsing issue in Lucene's
HTMLStripCharFilter. To resolve this without forking the Lucene class, this change introduces a pre-processing step usingPatternReplaceCharFilter. It identifies the problematic pattern (an equals sign followed by a double quote or whitespace) and escapes it to the HTML entity=. SinceHTMLStripCharFilterdecodes entities in its output text, this workaround is transparent to users while ensuring correct HTML stripping.Related Issues
Resolves #20554
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.