-
Notifications
You must be signed in to change notification settings - Fork 303
Add Unicode replacement and surrogate tolerance flags #227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Add YYJSON_READ_REPLACE_INVALID_UNICODE and YYJSON_READ_ALLOW_INVALID_SURROGATE flags to enable graceful handling of malformed Unicode in JSON strings. YYJSON_READ_REPLACE_INVALID_UNICODE replaces all invalid Unicode sequences with U+FFFD replacement character, ensuring output is always valid UTF-8. Handles invalid \uXXXX escapes, unpaired surrogates, invalid UTF-8 bytes, and control characters. YYJSON_READ_ALLOW_INVALID_SURROGATE permits unpaired surrogate code units in \uXXXX escapes, encoding them as UTF-8. Key changes: - Modified read_uni_esc() to accept flags parameter - Added replacement logic throughout string parsing paths - Replacement flag implicitly enables unicode tolerance flags - Added comprehensive test coverage - Flags not supported in incremental parsing mode This enables robust parsing of real-world JSON with encoding errors while maintaining strict standards compliance by default. Signed-off-by: Eduardo Silva <[email protected]>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #227 +/- ##
==========================================
- Coverage 98.50% 97.79% -0.72%
==========================================
Files 2 2
Lines 9040 9255 +215
==========================================
+ Hits 8905 9051 +146
- Misses 135 204 +69
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Thanks for the PR! Replacing invalid chars with U+FFFD makes sense — I already support do that on the serialization side. The main issue is during parsing: the unescaping process is performed in-place, so the unescaped character must not be longer than the original escape sequence. Since U+FFFD takes 3 bytes in UTF-8, it cannot safely replace invalid sequences that are only 1–2 bytes. I'd suggest introduce a new string subtype (e.g. YYJSON_SUBTYPE_UNIERR) to indicate strings containing invalid Unicode. This would be similar to the existing |
Upstream PR: ibireme/yyjson#227 Signed-off-by: Eduardo Silva <[email protected]>
Signed-off-by: Eduardo Silva <[email protected]>
|
@ibireme thanks for your feedback. I have updated the patch with the suggestions (majority of the changes were done with Codex) |
Upstream PR: ibireme/yyjson#227 Signed-off-by: Eduardo Silva <[email protected]>
Upstream PR: ibireme/yyjson#227 Signed-off-by: Eduardo Silva <[email protected]>
Signed-off-by: Eduardo Silva <[email protected]>
|
I thought a bit more about this and I think we should split bad strings into three categories:
The original goal was to turn malformed strings into valid UTF-8 internally. But with the current in-place parsing design, callers still need to check for invalid strings using So my suggestion is keep using the existing What do you think? |
Upstream PR: ibireme/yyjson#227 Signed-off-by: Eduardo Silva <[email protected]>
Add
YYJSON_READ_REPLACE_INVALID_UNICODEandYYJSON_READ_ALLOW_INVALID_SURROGATEflags to enable graceful handling of malformed Unicode in JSON strings.YYJSON_READ_REPLACE_INVALID_UNICODEreplaces all invalid Unicode sequences withU+FFFDreplacement character, ensuring output is always valid UTF-8. Handles invalid \uXXXX escapes, unpaired surrogates, invalid UTF-8 bytes, and control characters.YYJSON_READ_ALLOW_INVALID_SURROGATEpermits unpaired surrogate code units in \uXXXX escapes, encoding them as UTF-8.Key changes
why ?