Skip to content

snappy: fix intermittent decompression failures for raw snappy data#11550

Open
thewillyhuman wants to merge 2 commits intofluent:masterfrom
thewillyhuman:gfacundo/fix-snappy
Open

snappy: fix intermittent decompression failures for raw snappy data#11550
thewillyhuman wants to merge 2 commits intofluent:masterfrom
thewillyhuman:gfacundo/fix-snappy

Conversation

@thewillyhuman
Copy link

@thewillyhuman thewillyhuman commented Mar 13, 2026

Summary

  • Fix raw/block snappy data being misidentified as framed/streaming format, causing intermittent decompression failures
  • Fix multi-chunk framed snappy payloads being silently dropped due to an accumulated length reset

Problem

We observed that when sending metrics to Fluent Bit via the Prometheus Remote Write input plugin, some requests would randomly fail with:

[error] [http snappy] decompression failed

All metrics were confirmed as being sent correctly to Fluent Bit, yet not all of them made it through. The failures were intermittent and appeared random — the same configuration would work most of the time but occasionally drop data.

Root Cause

Bug 1: Raw snappy misidentified as framed (the intermittent failure)

Prometheus Remote Write uses raw/block snappy compression (as per the spec). Fluent Bit's flb_snappy_uncompress_framed_data() tries to detect whether incoming data is raw or framed by checking if the first byte equals 0xFF — the snappy stream identifier frame type.

The problem is that raw snappy data also starts with a varint-encoded uncompressed length, and a varint byte of 0xFF naturally occurs whenever the uncompressed payload size satisfies size >= 128 && size % 128 == 127 (e.g. sizes 255, 383, 511, 639, ...).

When this happens (~1 in 128 payloads depending on data size), the raw data is incorrectly routed into the framed parser, which then fails trying to match the "sNaPpY" stream identifier against compressed binary data.

This explains why the issue is intermittent — it depends entirely on the serialized protobuf payload size, which varies with the number and shape of metrics in each batch.

Bug 2: Multi-chunk framed data silently dropped

In the framed snappy code path, after the parsing loop accumulates the total decompressed length across all chunks, this code runs:

aggregated_data_buffer = NULL;
aggregated_data_length = 0;   // <-- bug: resets the accumulated length

The aggregated_data_length = 0 zeroes out the total before it is used for the output buffer allocation. For multi-chunk payloads, no buffer is allocated and all decompressed chunk data is discarded — while the function still returns success (0). This is a silent data loss bug.

Fix

Commit 1 [7800ae2]: Full stream identifier validation

Replace the single first-byte check with validation of the complete 10-byte snappy framed stream identifier header:

Offset Value Meaning
0 0xFF Stream identifier chunk type
1-3 0x06 0x00 0x00 Length = 6 (little-endian 24-bit)
4-9 sNaPpY Magic string

This eliminates false positives from raw snappy data whose varint happens to start with 0xFF.

Commit 2 [d1b8f03]: Remove erroneous length reset

Remove the aggregated_data_length = 0 line so the accumulated length from the parsing loop is preserved and used for buffer allocation in the multi-chunk path.

Test plan

  • Fluent Bit builds with no errors
  • All 80 internal tests pass (ctest — 100% pass rate)
  • Both patches apply cleanly on top of v4.1.0
  • Verified with Prometheus Remote Write workload that previously triggered ~1-in-128 failures

Summary by CodeRabbit

Bug Fixes

  • Enhanced detection of framed data format to ensure accurate identification and correct processing of compressed data streams, preventing potential misinterpretation of input formats.

The framed/streaming snappy format starts with a stream identifier
frame whose first byte is 0xFF. The previous check only tested this
single byte, but raw/block snappy data can also have 0xFF as its
first byte when the varint-encoded uncompressed length happens to
satisfy (size % 128 == 127 && size >= 128).

This caused intermittent decompression failures for Prometheus
Remote Write payloads (which use raw snappy) — roughly 1 in 128
requests would be misrouted into the framed parser and rejected.

Validate the full 10-byte stream identifier header (0xFF + 3-byte
length 0x000006 + "sNaPpY") instead of just the first byte.

Signed-off-by: Guillermo Facundo Colunga <guillermo.facundo.colunga@cern.ch>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link

coderabbitai bot commented Mar 13, 2026

📝 Walkthrough

Walkthrough

The pull request enhances frame detection in the snappy compression module by adding a string header inclusion and replacing a single-byte frame check with a robust 10-byte header validation. An explicit variable reset is also removed, relying on initial declaration initialization instead.

Changes

Cohort / File(s) Summary
Snappy Frame Detection Enhancement
src/flb_snappy.c
Adds string.h inclusion; replaces simple frame-type check with robust framed-data detection that validates a full 10-byte header instead of a single byte; removes explicit reset of aggregated_data_length, relying on initialization.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A frame so fine, ten bytes of grace,
No single check could set the pace,
Now snappy streams are read just right,
With headers verified, pixel bright!
String.h joins the party here,
Making data flows crystal clear! 🎉

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main fix: addressing intermittent decompression failures for raw snappy data by improving framed data detection.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can make CodeRabbit's review stricter and more nitpicky using the `assertive` profile, if that's what you prefer.

Change the reviews.profile setting to assertive to make CodeRabbit's nitpick more issues in your PRs.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/flb_snappy.c`:
- Around line 157-163: The conditional that falls back to the raw path can call
flb_snappy_uncompress(...) before checking output pointers, so move the
validation of out_data and out_len to precede that fallback branch and return -1
if either is NULL; specifically, ensure before calling flb_snappy_uncompress (in
the same conditional that checks FLB_SNAPPY_STREAM_IDENTIFIER_STRING and the
0xFF/0x06/0x00/0x00 header bytes) you validate out_data and out_len and bail out
with -1 rather than invoking flb_snappy_uncompress with NULL outputs.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0ad137c9-e314-4556-9b2e-80eb348e351e

📥 Commits

Reviewing files that changed from the base of the PR and between acad02a and e9cb70b.

📒 Files selected for processing (1)
  • src/flb_snappy.c

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e9cb70b02d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@@ -282,7 +302,6 @@ int flb_snappy_uncompress_framed_data(char *in_data, size_t in_len,
}

aggregated_data_buffer = NULL;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Prevent leaking output buffer when framed parse fails

Removing the aggregated_data_length = 0 reset fixes multi-chunk success cases, but it also means aggregated_data_length can stay non-zero when a later frame sets result != 0 (for example, checksum mismatch after one valid chunk). In that case the non-fast path still allocates aggregated_data_buffer and returns an error, and current callers treat nonzero returns as failure without freeing *out_data, so malformed multi-frame snappy payloads can now leak memory per request. Please gate allocation on result == 0 (or clear the length on error) before setting *out_data.

Useful? React with 👍 / 👎.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated commit to take into account this case. This is completely fair point, we discussed internally but this review confirmed it.

@thewillyhuman
Copy link
Author

thewillyhuman commented Mar 13, 2026

Following image shows the metrics that were passing through Fluent Bit before and after the patch. It is clear that before the patch metrics were been silently drop because of the described error.

Screenshot 2026-03-13 at 16 53 43

After the frame parsing loop accumulates the total decompressed
length in aggregated_data_length, the variable was incorrectly
reset to zero before being used for the output buffer allocation.
This caused the multi-chunk code path to skip allocation entirely,
producing a NULL output buffer and zero length with a success
return code — silently dropping all decompressed data.

Remove the erroneous reset so the accumulated length is preserved
and used for the aggregation buffer allocation. Also gate the
allocation on result == 0 and free any allocated buffer on error,
preventing a memory leak when a multi-chunk stream fails mid-way
(e.g. checksum mismatch after one valid chunk).

Signed-off-by: Guillermo Facundo Colunga <guillermo.facundo.colunga@cern.ch>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@thewillyhuman
Copy link
Author

@edsiper, @cosmo0920 Is there anything extra I must do before the MR gets first-reviewed ? -Almost a week with no interaction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant