Skip to content

Conversation

@edsiper
Copy link

@edsiper edsiper commented Aug 19, 2025

Add YYJSON_READ_REPLACE_INVALID_UNICODE and YYJSON_READ_ALLOW_INVALID_SURROGATE flags to enable graceful handling of malformed Unicode in JSON strings.

YYJSON_READ_REPLACE_INVALID_UNICODE replaces all invalid Unicode sequences with U+FFFD replacement character, ensuring output is always valid UTF-8. Handles invalid \uXXXX escapes, unpaired surrogates, invalid UTF-8 bytes, and control characters.

YYJSON_READ_ALLOW_INVALID_SURROGATE permits unpaired surrogate code units in \uXXXX escapes, encoding them as UTF-8.

Key changes

  • Modified read_uni_esc() to accept flags parameter
  • Added replacement logic throughout string parsing paths
  • Replacement flag implicitly enables unicode tolerance flags
  • Added comprehensive test coverage
  • Flags not supported in incremental parsing mode

why ?

  • This enables robust parsing of real-world JSON with encoding errors while maintaining strict standards compliance by default.
  • While we would love to live in strict parsing mode, we have to deal with malformed data, dealing with invalid bytes is times expensive in the caller than handling this directly inside yyjson

Add YYJSON_READ_REPLACE_INVALID_UNICODE and YYJSON_READ_ALLOW_INVALID_SURROGATE flags to enable
graceful handling of malformed Unicode in JSON strings.

YYJSON_READ_REPLACE_INVALID_UNICODE replaces all invalid Unicode sequences with U+FFFD replacement
character, ensuring output is always valid UTF-8. Handles invalid \uXXXX escapes, unpaired surrogates,
invalid UTF-8 bytes, and control characters.

YYJSON_READ_ALLOW_INVALID_SURROGATE permits unpaired surrogate code units in \uXXXX escapes,
encoding them as UTF-8.

Key changes:

- Modified read_uni_esc() to accept flags parameter
- Added replacement logic throughout string parsing paths
- Replacement flag implicitly enables unicode tolerance flags
- Added comprehensive test coverage
- Flags not supported in incremental parsing mode

This enables robust parsing of real-world JSON with encoding
errors while maintaining strict standards compliance by default.

Signed-off-by: Eduardo Silva <[email protected]>
@codecov
Copy link

codecov bot commented Aug 20, 2025

Codecov Report

❌ Patch coverage is 70.93596% with 59 lines in your changes missing coverage. Please review.
✅ Project coverage is 97.79%. Comparing base (706ae61) to head (825aa37).
⚠️ Report is 10 commits behind head on master.

Files with missing lines Patch % Lines
src/yyjson.c 70.79% 59 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #227      +/-   ##
==========================================
- Coverage   98.50%   97.79%   -0.72%     
==========================================
  Files           2        2              
  Lines        9040     9255     +215     
==========================================
+ Hits         8905     9051     +146     
- Misses        135      204      +69     
Flag Coverage Δ
unittests 97.79% <70.93%> (-0.72%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ibireme
Copy link
Owner

ibireme commented Aug 20, 2025

Thanks for the PR!

Replacing invalid chars with U+FFFD makes sense — I already support do that on the serialization side.

The main issue is during parsing: the unescaping process is performed in-place, so the unescaped character must not be longer than the original escape sequence. Since U+FFFD takes 3 bytes in UTF-8, it cannot safely replace invalid sequences that are only 1–2 bytes.

For example: "abcdefg", `d` is escaped
    src         
     │          
┌────▼─────────┐
│"abd\u0064efg"│
└────▲─────────┘
     │          
    dst         
          src   
           │    
┌──────────▼───┐
│"abcdu0064efg"│
└─────▲────────┘
      │         
     dst        
             src
              │ 
┌─────────────▼┐
│"abcdefg 4efg"│
└────────▲─────┘
         │      
        dst     

I'd suggest introduce a new string subtype (e.g. YYJSON_SUBTYPE_UNIERR) to indicate strings containing invalid Unicode. This would be similar to the existing YYJSON_SUBTYPE_NOESC, but it would still require the caller to detect and handle such cases.

edsiper added a commit to fluent/fluent-bit that referenced this pull request Aug 20, 2025
Signed-off-by: Eduardo Silva <[email protected]>
@edsiper
Copy link
Author

edsiper commented Sep 18, 2025

@ibireme thanks for your feedback. I have updated the patch with the suggestions (majority of the changes were done with Codex)

edsiper added a commit to fluent/fluent-bit that referenced this pull request Sep 18, 2025
edsiper added a commit to fluent/fluent-bit that referenced this pull request Sep 19, 2025
@ibireme
Copy link
Owner

ibireme commented Sep 21, 2025

I thought a bit more about this and I think we should split bad strings into three categories:

  1. Invalid UTF-8 bytes
    Since yyjson decodes in-place, we can't always replace them with U+FFFD. So REPLACE_INVALID_UNICODE doesn't really help here.

  2. Invalid single-character escapes
    Right now we just error on things like \Q.
    For reference, JSON5 just drops the \ and outputs Q.

  3. Invalid \uXXXX sequences
    These are trickier since the error could come from many places.
    The current PR sometimes gives surprising results, e.g. \u😀 -> \uŸ˜€, \uA_CD -> \uACD.

The original goal was to turn malformed strings into valid UTF-8 internally. But with the current in-place parsing design, callers still need to check for invalid strings using SUBTYPE_UNIERR.

So my suggestion is keep using the existing READ_ALLOW_INVALID_UNICODE, just mark invalid strings with SUBTYPE_UNIERR.
Add a new flag READ_ALLOW_INVALID_ESCAPE to handle all bad escapes (either keep them as-is or just drop the \).

What do you think?

SagiROosto pushed a commit to AnyVisionltd/fluent-bit that referenced this pull request Nov 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants