snappy: fix intermittent decompression failures for raw snappy data#11550
snappy: fix intermittent decompression failures for raw snappy data#11550thewillyhuman wants to merge 2 commits intofluent:masterfrom
Conversation
The framed/streaming snappy format starts with a stream identifier frame whose first byte is 0xFF. The previous check only tested this single byte, but raw/block snappy data can also have 0xFF as its first byte when the varint-encoded uncompressed length happens to satisfy (size % 128 == 127 && size >= 128). This caused intermittent decompression failures for Prometheus Remote Write payloads (which use raw snappy) — roughly 1 in 128 requests would be misrouted into the framed parser and rejected. Validate the full 10-byte stream identifier header (0xFF + 3-byte length 0x000006 + "sNaPpY") instead of just the first byte. Signed-off-by: Guillermo Facundo Colunga <guillermo.facundo.colunga@cern.ch> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughThe pull request enhances frame detection in the snappy compression module by adding a string header inclusion and replacing a single-byte frame check with a robust 10-byte header validation. An explicit variable reset is also removed, relying on initial declaration initialization instead. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip You can make CodeRabbit's review stricter and more nitpicky using the `assertive` profile, if that's what you prefer.Change the |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/flb_snappy.c`:
- Around line 157-163: The conditional that falls back to the raw path can call
flb_snappy_uncompress(...) before checking output pointers, so move the
validation of out_data and out_len to precede that fallback branch and return -1
if either is NULL; specifically, ensure before calling flb_snappy_uncompress (in
the same conditional that checks FLB_SNAPPY_STREAM_IDENTIFIER_STRING and the
0xFF/0x06/0x00/0x00 header bytes) you validate out_data and out_len and bail out
with -1 rather than invoking flb_snappy_uncompress with NULL outputs.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e9cb70b02d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| @@ -282,7 +302,6 @@ int flb_snappy_uncompress_framed_data(char *in_data, size_t in_len, | |||
| } | |||
|
|
|||
| aggregated_data_buffer = NULL; | |||
There was a problem hiding this comment.
Prevent leaking output buffer when framed parse fails
Removing the aggregated_data_length = 0 reset fixes multi-chunk success cases, but it also means aggregated_data_length can stay non-zero when a later frame sets result != 0 (for example, checksum mismatch after one valid chunk). In that case the non-fast path still allocates aggregated_data_buffer and returns an error, and current callers treat nonzero returns as failure without freeing *out_data, so malformed multi-frame snappy payloads can now leak memory per request. Please gate allocation on result == 0 (or clear the length on error) before setting *out_data.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
I have updated commit to take into account this case. This is completely fair point, we discussed internally but this review confirmed it.
e9cb70b to
eb3ded1
Compare
After the frame parsing loop accumulates the total decompressed length in aggregated_data_length, the variable was incorrectly reset to zero before being used for the output buffer allocation. This caused the multi-chunk code path to skip allocation entirely, producing a NULL output buffer and zero length with a success return code — silently dropping all decompressed data. Remove the erroneous reset so the accumulated length is preserved and used for the aggregation buffer allocation. Also gate the allocation on result == 0 and free any allocated buffer on error, preventing a memory leak when a multi-chunk stream fails mid-way (e.g. checksum mismatch after one valid chunk). Signed-off-by: Guillermo Facundo Colunga <guillermo.facundo.colunga@cern.ch> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
eb3ded1 to
d1b8f03
Compare
|
@edsiper, @cosmo0920 Is there anything extra I must do before the MR gets first-reviewed ? -Almost a week with no interaction. |

Summary
Problem
We observed that when sending metrics to Fluent Bit via the Prometheus Remote Write input plugin, some requests would randomly fail with:
All metrics were confirmed as being sent correctly to Fluent Bit, yet not all of them made it through. The failures were intermittent and appeared random — the same configuration would work most of the time but occasionally drop data.
Root Cause
Bug 1: Raw snappy misidentified as framed (the intermittent failure)
Prometheus Remote Write uses raw/block snappy compression (as per the spec). Fluent Bit's
flb_snappy_uncompress_framed_data()tries to detect whether incoming data is raw or framed by checking if the first byte equals0xFF— the snappy stream identifier frame type.The problem is that raw snappy data also starts with a varint-encoded uncompressed length, and a varint byte of
0xFFnaturally occurs whenever the uncompressed payload size satisfiessize >= 128 && size % 128 == 127(e.g. sizes 255, 383, 511, 639, ...).When this happens (~1 in 128 payloads depending on data size), the raw data is incorrectly routed into the framed parser, which then fails trying to match the
"sNaPpY"stream identifier against compressed binary data.This explains why the issue is intermittent — it depends entirely on the serialized protobuf payload size, which varies with the number and shape of metrics in each batch.
Bug 2: Multi-chunk framed data silently dropped
In the framed snappy code path, after the parsing loop accumulates the total decompressed length across all chunks, this code runs:
The
aggregated_data_length = 0zeroes out the total before it is used for the output buffer allocation. For multi-chunk payloads, no buffer is allocated and all decompressed chunk data is discarded — while the function still returns success (0). This is a silent data loss bug.Fix
Commit 1 [7800ae2]: Full stream identifier validation
Replace the single first-byte check with validation of the complete 10-byte snappy framed stream identifier header:
0xFF0x06 0x00 0x00sNaPpYThis eliminates false positives from raw snappy data whose varint happens to start with
0xFF.Commit 2 [d1b8f03]: Remove erroneous length reset
Remove the
aggregated_data_length = 0line so the accumulated length from the parsing loop is preserved and used for buffer allocation in the multi-chunk path.Test plan
ctest— 100% pass rate)v4.1.0Summary by CodeRabbit
Bug Fixes