snappy: fix intermittent decompression failures for raw snappy data by thewillyhuman · Pull Request #11550 · fluent/fluent-bit

thewillyhuman · 2026-03-13T15:24:52Z

Summary

Fix raw/block snappy data being misidentified as framed/streaming format, causing intermittent decompression failures
Fix multi-chunk framed snappy payloads being silently dropped due to an accumulated length reset

Problem

We observed that when sending metrics to Fluent Bit via the Prometheus Remote Write input plugin, some requests would randomly fail with:

[error] [http snappy] decompression failed

All metrics were confirmed as being sent correctly to Fluent Bit, yet not all of them made it through. The failures were intermittent and appeared random — the same configuration would work most of the time but occasionally drop data.

Root Cause

Bug 1: Raw snappy misidentified as framed (the intermittent failure)

Prometheus Remote Write uses raw/block snappy compression (as per the spec). Fluent Bit's flb_snappy_uncompress_framed_data() tries to detect whether incoming data is raw or framed by checking if the first byte equals 0xFF — the snappy stream identifier frame type.

The problem is that raw snappy data also starts with a varint-encoded uncompressed length, and a varint byte of 0xFF naturally occurs whenever the uncompressed payload size satisfies size >= 128 && size % 128 == 127 (e.g. sizes 255, 383, 511, 639, ...).

When this happens (~1 in 128 payloads depending on data size), the raw data is incorrectly routed into the framed parser, which then fails trying to match the "sNaPpY" stream identifier against compressed binary data.

This explains why the issue is intermittent — it depends entirely on the serialized protobuf payload size, which varies with the number and shape of metrics in each batch.

Bug 2: Multi-chunk framed data silently dropped

In the framed snappy code path, after the parsing loop accumulates the total decompressed length across all chunks, this code runs:

aggregated_data_buffer = NULL;
aggregated_data_length = 0;   // <-- bug: resets the accumulated length

The aggregated_data_length = 0 zeroes out the total before it is used for the output buffer allocation. For multi-chunk payloads, no buffer is allocated and all decompressed chunk data is discarded — while the function still returns success (0). This is a silent data loss bug.

Fix

Commit 1 [`7800ae2`]: Full stream identifier validation

Replace the single first-byte check with validation of the complete 10-byte snappy framed stream identifier header:

Offset	Value	Meaning
0	`0xFF`	Stream identifier chunk type
1-3	`0x06 0x00 0x00`	Length = 6 (little-endian 24-bit)
4-9	`sNaPpY`	Magic string

This eliminates false positives from raw snappy data whose varint happens to start with 0xFF.

Commit 2 [`d1b8f03`]: Remove erroneous length reset

Remove the aggregated_data_length = 0 line so the accumulated length from the parsing loop is preserved and used for buffer allocation in the multi-chunk path.

Test plan

Fluent Bit builds with no errors
All 80 internal tests pass (ctest — 100% pass rate)
Both patches apply cleanly on top of v4.1.0
Verified with Prometheus Remote Write workload that previously triggered ~1-in-128 failures

Summary by CodeRabbit

Bug Fixes

Enhanced detection of framed data format to ensure accurate identification and correct processing of compressed data streams, preventing potential misinterpretation of input formats.

The framed/streaming snappy format starts with a stream identifier frame whose first byte is 0xFF. The previous check only tested this single byte, but raw/block snappy data can also have 0xFF as its first byte when the varint-encoded uncompressed length happens to satisfy (size % 128 == 127 && size >= 128). This caused intermittent decompression failures for Prometheus Remote Write payloads (which use raw snappy) — roughly 1 in 128 requests would be misrouted into the framed parser and rejected. Validate the full 10-byte stream identifier header (0xFF + 3-byte length 0x000006 + "sNaPpY") instead of just the first byte. Signed-off-by: Guillermo Facundo Colunga <guillermo.facundo.colunga@cern.ch> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-03-13T15:25:27Z

📝 Walkthrough

Walkthrough

The pull request enhances frame detection in the snappy compression module by adding a string header inclusion and replacing a single-byte frame check with a robust 10-byte header validation. An explicit variable reset is also removed, relying on initial declaration initialization instead.

Changes

Cohort / File(s)	Summary
Snappy Frame Detection Enhancement `src/flb_snappy.c`	Adds `string.h` inclusion; replaces simple frame-type check with robust framed-data detection that validates a full 10-byte header instead of a single byte; removes explicit reset of `aggregated_data_length`, relying on initialization.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A frame so fine, ten bytes of grace,
No single check could set the pace,
Now snappy streams are read just right,
With headers verified, pixel bright!
String.h joins the party here,
Making data flows crystal clear! 🎉

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main fix: addressing intermittent decompression failures for raw snappy data by improving framed data detection.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can make CodeRabbit's review stricter and more nitpicky using the `assertive` profile, if that's what you prefer.

Change the reviews.profile setting to assertive to make CodeRabbit's nitpick more issues in your PRs.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/flb_snappy.c`:
- Around line 157-163: The conditional that falls back to the raw path can call
flb_snappy_uncompress(...) before checking output pointers, so move the
validation of out_data and out_len to precede that fallback branch and return -1
if either is NULL; specifically, ensure before calling flb_snappy_uncompress (in
the same conditional that checks FLB_SNAPPY_STREAM_IDENTIFIER_STRING and the
0xFF/0x06/0x00/0x00 header bytes) you validate out_data and out_len and bail out
with -1 rather than invoking flb_snappy_uncompress with NULL outputs.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0ad137c9-e314-4556-9b2e-80eb348e351e

📥 Commits

Reviewing files that changed from the base of the PR and between acad02a and e9cb70b.

📒 Files selected for processing (1)

src/flb_snappy.c

src/flb_snappy.c

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e9cb70b02d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-13T15:31:16Z

src/flb_snappy.c

@@ -282,7 +302,6 @@ int flb_snappy_uncompress_framed_data(char *in_data, size_t in_len,
    }

    aggregated_data_buffer = NULL;


Prevent leaking output buffer when framed parse fails

Removing the aggregated_data_length = 0 reset fixes multi-chunk success cases, but it also means aggregated_data_length can stay non-zero when a later frame sets result != 0 (for example, checksum mismatch after one valid chunk). In that case the non-fast path still allocates aggregated_data_buffer and returns an error, and current callers treat nonzero returns as failure without freeing *out_data, so malformed multi-frame snappy payloads can now leak memory per request. Please gate allocation on result == 0 (or clear the length on error) before setting *out_data.

Useful? React with 👍 / 👎.

I have updated commit to take into account this case. This is completely fair point, we discussed internally but this review confirmed it.

thewillyhuman · 2026-03-13T15:56:54Z

Following image shows the metrics that were passing through Fluent Bit before and after the patch. It is clear that before the patch metrics were been silently drop because of the described error.

After the frame parsing loop accumulates the total decompressed length in aggregated_data_length, the variable was incorrectly reset to zero before being used for the output buffer allocation. This caused the multi-chunk code path to skip allocation entirely, producing a NULL output buffer and zero length with a success return code — silently dropping all decompressed data. Remove the erroneous reset so the accumulated length is preserved and used for the aggregation buffer allocation. Also gate the allocation on result == 0 and free any allocated buffer on error, preventing a memory leak when a multi-chunk stream fails mid-way (e.g. checksum mismatch after one valid chunk). Signed-off-by: Guillermo Facundo Colunga <guillermo.facundo.colunga@cern.ch> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

thewillyhuman · 2026-03-19T16:13:52Z

@edsiper, @cosmo0920 Is there anything extra I must do before the MR gets first-reviewed ? -Almost a week with no interaction.

thewillyhuman requested review from cosmo0920 and edsiper as code owners March 13, 2026 15:24

github-actions bot added the docs-required label Mar 13, 2026

coderabbitai bot reviewed Mar 13, 2026

View reviewed changes

src/flb_snappy.c Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Mar 13, 2026

View reviewed changes

coderabbitai bot mentioned this pull request Mar 13, 2026

flb_snappy: harden out_data/out_len NULL checks in flb_snappy_uncompress_framed_data #11552

Open

thewillyhuman force-pushed the gfacundo/fix-snappy branch from e9cb70b to eb3ded1 Compare March 13, 2026 16:11

thewillyhuman force-pushed the gfacundo/fix-snappy branch from eb3ded1 to d1b8f03 Compare March 13, 2026 16:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

snappy: fix intermittent decompression failures for raw snappy data#11550

snappy: fix intermittent decompression failures for raw snappy data#11550
thewillyhuman wants to merge 2 commits intofluent:masterfrom
thewillyhuman:gfacundo/fix-snappy

thewillyhuman commented Mar 13, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Mar 13, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 13, 2026

Uh oh!

thewillyhuman Mar 13, 2026

Uh oh!

thewillyhuman commented Mar 13, 2026 •

edited

Loading

Uh oh!

thewillyhuman commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		@@ -282,7 +302,6 @@ int flb_snappy_uncompress_framed_data(char *in_data, size_t in_len,
		}

		aggregated_data_buffer = NULL;

Conversation

thewillyhuman commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Root Cause

Bug 1: Raw snappy misidentified as framed (the intermittent failure)

Bug 2: Multi-chunk framed data silently dropped

Fix

Commit 1 [7800ae2]: Full stream identifier validation

Commit 2 [d1b8f03]: Remove erroneous length reset

Test plan

Summary by CodeRabbit

Bug Fixes

Uh oh!

coderabbitai bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

thewillyhuman Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

thewillyhuman commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thewillyhuman commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

thewillyhuman commented Mar 13, 2026 •

edited

Loading

Commit 1 [`7800ae2`]: Full stream identifier validation

Commit 2 [`d1b8f03`]: Remove erroneous length reset

coderabbitai bot commented Mar 13, 2026 •

edited

Loading

thewillyhuman commented Mar 13, 2026 •

edited

Loading