Skip to content

Move input coroutine management variables to TLS#11576

Open
runderwo wants to merge 1 commit intofluent:masterfrom
runderwo:master
Open

Move input coroutine management variables to TLS#11576
runderwo wants to merge 1 commit intofluent:masterfrom
runderwo:master

Conversation

@runderwo
Copy link
Contributor

@runderwo runderwo commented Mar 17, 2026

In c98a68e, coroutine state was moved to thread-local storage. However, the variables introduced to manage thread-local coroutine state were not themselves thread-local, which is a problem because flb_input_coro_collect() is not guarded against re-entrancy. This results in a race when multiple threads wake up from polling and run, because they then both read (e.g. in input_pre_cb_collect()) and overwrite (e.g. in flb_input_coro_collect()) the TLS key used to look up the stored coroutine state (libco_in_param_key), without synchronization.

There are almost certainly additional bugs here but this at least mitigates the race condition. The error handling logic in input_pre_cb_collect() was removed and replaced with an assert because the recovery path made no sense, and it's not clear there is in fact a sensible recovery path given the code structure.

Most likely this is related to #11488 and may be the root cause.


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
  • Debug log output from testing the change
  • Attached Valgrind output that shows no leaks or memory corruption was found

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Summary by CodeRabbit

  • Refactor
    • Per-thread data now uses true thread-local storage for more reliable multi-threaded behavior.
  • Bug Fixes
    • Improved error detection in input handling so misconfigurations are caught earlier; callers must ensure per-thread parameters are initialized.

@coderabbitai
Copy link

coderabbitai bot commented Mar 17, 2026

📝 Walkthrough

Walkthrough

Two global variables (libco_in_param_key, libco_in_param) were changed to use C thread-local storage (__thread); lazy allocation inside input_pre_cb_collect was removed and replaced with an assertion that per-thread params must be pre-initialized by callers.

Changes

Cohort / File(s) Summary
Thread-local storage declarations
include/fluent-bit/flb_input.h, src/flb_input.c
Added __thread qualifier to libco_in_param_key and libco_in_param, converting their storage to C TLS.
Control flow / error handling
src/flb_input.c (input_pre_cb_collect)
Removed lazy allocation/fallback for per-thread params; now asserts (flb_bug) if params are NULL, requiring callers to populate TLS beforehand.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

backport to v4.0.x, backport to v4.2.x

Suggested reviewers

  • edsiper
  • cosmo0920

Poem

🐇 I hop through threads with a tiny cheer,
Each burrow keeps its params snug and near.
No lazy keys rattling in midnight code,
__thread tucks each secret down its road.
Hooray — each thread has its own cozy load!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely describes the main change: moving input coroutine management variables to thread-local storage (TLS), which directly corresponds to the declared changes adding __thread annotations to libco_in_param_key and libco_in_param.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 24bbbd6465

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +581 to +582
extern pthread_key_t __thread libco_in_param_key;
extern struct flb_libco_in_params __thread libco_in_param;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Guard libco TLS declarations behind FLB_HAVE_C_TLS

CMakeLists.txt explicitly supports builds where compiler TLS is unavailable (or forced off with FLB_POSIX_TLS), but these declarations now unconditionally require __thread. That breaks the non-C-TLS build path by making flb_input.h depend on compiler TLS syntax in all configurations, so affected targets fail at compile time before the pthread-key fallback can be used. Please switch this to the existing FLB_TLS_* abstraction or gate it with #ifdef FLB_HAVE_C_TLS.

Useful? React with 👍 / 👎.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cosmo0920 Open to suggestions here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use the existing TLS abstraction or conditionally compile the declarations like other sources do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If conditional compilation is required, I think that means this approach is broken since the write conflict would still exist on any system without TLS. Should we protect the key variable with a mutex instead?

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@include/fluent-bit/flb_input.h`:
- Around line 617-618: Change input_params_set() to return an int error code (0
on success, non-zero on failure) and propagate allocation/pthread_setspecific
errors from flb_calloc and pthread_setspecific instead of silently swallowing
them; update callers (notably flb_input_coro_collect()) to check the return
value and on failure unwind/cleanup coroutine creation and any allocated
resources. Remove the unsafe flb_bug(params == NULL) assertion in
input_pre_cb_collect() and replace it with proper error handling (return an
error/fatal log if params is NULL after successful bootstrap) so coroutine
resume won’t abort; also stop treating pthread_key_t as __thread—ensure
pthread_key_create() is done once globally (not per-coroutine) and use
pthread_setspecific/pthread_getspecific for per-coroutine storage. Ensure
references to input_params_set, flb_calloc, pthread_setspecific,
flb_input_coro_collect, input_pre_cb_collect, pthread_key_t, and
pthread_key_create are updated accordingly.

In `@src/flb_input.c`:
- Around line 55-56: The pthread_key_t was incorrectly declared __thread which
gives each thread its own key handle and causes pthread_key_create() (called in
the collect path) to leak process-wide TSD keys; remove the __thread qualifier
from libco_in_param_key so there is a single process-wide key, and change the
initialization logic so pthread_key_create(libco_in_param_key, ...) is performed
exactly once (e.g., guarded by a static/atomic "initialized" flag in the module
init or in the function that currently creates the key) or alternatively remove
the pthread_key_t usage entirely and use __thread libco_in_param directly;
update any code paths that assume per-collect key creation (the code invoking
pthread_key_create and the flb_bug(params == NULL) error path) to reflect the
single-key or thread-local-only approach.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7b3ef4a8-cfcc-4caa-8f2a-dff5889ac801

📥 Commits

Reviewing files that changed from the base of the PR and between 962855c and 24bbbd6.

📒 Files selected for processing (2)
  • include/fluent-bit/flb_input.h
  • src/flb_input.c

Comment on lines +617 to +618
flb_bug(params == NULL);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

nl -ba include/fluent-bit/flb_input.h | sed -n '585,670p'

Repository: fluent/fluent-bit

Length of output: 103


🏁 Script executed:

#!/bin/bash
# Read lines 585-670 from flb_input.h
cat -n include/fluent-bit/flb_input.h | sed -n '585,670p'

Repository: fluent/fluent-bit

Length of output: 3099


🏁 Script executed:

#!/bin/bash
# Check the extern declarations and __thread usage
head -100 include/fluent-bit/flb_input.h | cat -n

Repository: fluent/fluent-bit

Length of output: 4123


🏁 Script executed:

#!/bin/bash
# Search for libco_in_param_key declaration
rg "libco_in_param_key" include/fluent-bit/flb_input.h -B 2 -A 2

Repository: fluent/fluent-bit

Length of output: 908


🏁 Script executed:

#!/bin/bash
# Search for libco_in_param_key definition in the source files
rg "libco_in_param_key" src/ --type c -B 1 -A 1 | head -50

Repository: fluent/fluent-bit

Length of output: 204


Eliminate the unsafe assertion by propagating allocation/setup errors.

The assertion at line 617 assumes params is set, but input_params_set() returns void and silently fails if flb_calloc() or pthread_setspecific() fails. When the coroutine resumes and calls input_pre_cb_collect(), the assertion aborts instead of failing cleanly. Additionally, the __thread declaration of pthread_key_t is incorrect—pthread keys are global identifiers, not thread-local variables—and pthread_key_create() should not be called per-coroutine.

Required changes:

  1. Make input_params_set() return an error code (e.g., int with 0 for success)
  2. Check the return value in flb_input_coro_collect() and unwind coroutine creation on failure
  3. Replace the assertion in input_pre_cb_collect() with proper error handling or a fatal log if params is NULL (after confirming bootstrap succeeded)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@include/fluent-bit/flb_input.h` around lines 617 - 618, Change
input_params_set() to return an int error code (0 on success, non-zero on
failure) and propagate allocation/pthread_setspecific errors from flb_calloc and
pthread_setspecific instead of silently swallowing them; update callers (notably
flb_input_coro_collect()) to check the return value and on failure
unwind/cleanup coroutine creation and any allocated resources. Remove the unsafe
flb_bug(params == NULL) assertion in input_pre_cb_collect() and replace it with
proper error handling (return an error/fatal log if params is NULL after
successful bootstrap) so coroutine resume won’t abort; also stop treating
pthread_key_t as __thread—ensure pthread_key_create() is done once globally (not
per-coroutine) and use pthread_setspecific/pthread_getspecific for per-coroutine
storage. Ensure references to input_params_set, flb_calloc, pthread_setspecific,
flb_input_coro_collect, input_pre_cb_collect, pthread_key_t, and
pthread_key_create are updated accordingly.

Comment on lines +55 to +56
struct flb_libco_in_params __thread libco_in_param;
pthread_key_t __thread libco_in_param_key;
Copy link

@coderabbitai coderabbitai bot Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n -C3 '\blibco_in_param_key\b|pthread_key_create\s*\(|pthread_key_delete\s*\(|pthread_once\s*\(|pthread_(get|set)specific\s*\(' include/fluent-bit/flb_input.h src/flb_input.c

Repository: fluent/fluent-bit

Length of output: 2526


Do not make the pthread_key_t itself __thread.

pthread_key_t already identifies a thread-specific slot. With this change, each worker owns a different key handle, while Line 652 in include/fluent-bit/flb_input.h still calls pthread_key_create() on every collect. That leaks one process-wide TSD key per collect and never deletes it, so long-running inputs can exhaust PTHREAD_KEYS_MAX and then fall into the unchecked setup / flb_bug(params == NULL) path at line 617. Keep one process-wide key initialized once, or remove the key entirely and use __thread libco_in_param directly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/flb_input.c` around lines 55 - 56, The pthread_key_t was incorrectly
declared __thread which gives each thread its own key handle and causes
pthread_key_create() (called in the collect path) to leak process-wide TSD keys;
remove the __thread qualifier from libco_in_param_key so there is a single
process-wide key, and change the initialization logic so
pthread_key_create(libco_in_param_key, ...) is performed exactly once (e.g.,
guarded by a static/atomic "initialized" flag in the module init or in the
function that currently creates the key) or alternatively remove the
pthread_key_t usage entirely and use __thread libco_in_param directly; update
any code paths that assume per-collect key creation (the code invoking
pthread_key_create and the flb_bug(params == NULL) error path) to reflect the
single-key or thread-local-only approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cosmo0920 Also looking for suggestions here. Removing the key entirely and moving the coroutine params to TLS seems to be the most sensible path, since as the review mentioned, the key is leaked anyway (I observed this during debugging).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

In c98a68e, coroutine state was moved to thread-local storage.
However, the variables introduced to manage thread-local coroutine
state were not themselves thread-local, which is a problem because
flb_input_coro_collect() is not guarded against re-entrancy.  This
results in a race when multiple threads wake up from polling and
run, because they then both read (e.g. in input_pre_cb_collect())
and overwrite (e.g. in flb_input_coro_collect()) the TLS key used
to look up the stored coroutine state (libco_in_param_key), without
synchronization.

There are almost certainly additional bugs here but this at least
mitigates the race condition.  The error handling logic in
input_pre_cb_collect() was removed and replaced with an assert
because the recovery path made no sense, and it's not clear there
is in fact a sensible recovery path given the code structure.

Most likely this is related to fluent#11488 and may be the root cause.

Signed-off-by: Ryan Underwood <ryan.underwood@mongodb.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@include/fluent-bit/flb_input.h`:
- Around line 581-582: The pthread TLS usage is mixed up: do not declare
pthread_key_t __thread; pick one approach and reconcile libco_in_param_key and
libco_in_param. Either (A) remove pthread_key_t entirely and make the params
struct truly thread-local by declaring libco_in_param as __thread and replace
pthread_getspecific/pthread_setspecific usage with direct access to
libco_in_param, or (B) keep libco_in_param_key as a single global pthread_key_t
(remove __thread from libco_in_param_key and libco_in_param), move the
pthread_key_create call out of flb_input_coro_collect to a one-time
initialization path, and ensure all accesses use
pthread_getspecific/pthread_setspecific with libco_in_param_key; also remove the
unused libco_in_param declaration when using option B.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0f8e9437-bfb6-44b7-b362-aa18ff502522

📥 Commits

Reviewing files that changed from the base of the PR and between cd45b64 and c12e049.

📒 Files selected for processing (2)
  • include/fluent-bit/flb_input.h
  • src/flb_input.c
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/flb_input.c

Comment on lines +581 to +582
extern pthread_key_t __thread libco_in_param_key;
extern struct flb_libco_in_params __thread libco_in_param;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify the definition in the source file matches the header declaration
rg -n '__thread.*libco_in_param' src/flb_input.c -B1 -A1

# Check how pthread_key_create is called - should be once, not per-coroutine
ast-grep --pattern 'pthread_key_create($_, $_)'

Repository: fluent/fluent-bit

Length of output: 3133


🏁 Script executed:

# Find the function containing line 652 in flb_input.h
head -700 include/fluent-bit/flb_input.h | tail -60 | cat -n

# Check if this is in a per-coroutine or initialization function
rg -n 'flb_input_coro_collect|pthread_key_create' include/fluent-bit/flb_input.h -B5 -A5

# Check how libco_in_param is actually accessed in the code
rg -n 'libco_in_param' src/flb_input.c | head -30

Repository: fluent/fluent-bit

Length of output: 3082


🏁 Script executed:

# Check how libco_in_param_key is used - getspecific/setspecific or direct access
rg -n 'pthread_getspecific|pthread_setspecific|libco_in_param' src/flb_input.c -B2 -A2 | head -60

# Also check include files
rg -n 'libco_in_param' include/fluent-bit/flb_input.h -B2 -A2

Repository: fluent/fluent-bit

Length of output: 1538


Critical: pthread_key_t must not be __thread—this breaks the TLS mechanism.

A pthread_key_t is a global identifier/handle shared by all threads to access their respective thread-local values via pthread_getspecific/pthread_setspecific. Making it __thread means each thread gets its own key variable. Combined with pthread_key_create at line 652 being called inside flb_input_coro_collect() (not at initialization), each thread creates a different key every time a coroutine is collected—completely breaking the shared-key model.

Choose one TLS approach:

  1. Use __thread directly on the params struct (eliminate pthread_key_t entirely), OR
  2. Keep pthread_key_t global (not __thread) and call pthread_key_create exactly once at initialization (not per-coroutine)

Additionally, libco_in_param at line 582 is declared with __thread but never used—the code uses pthread_getspecific(libco_in_param_key) instead. This unused declaration suggests the two TLS approaches are being conflated and should be reconciled.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@include/fluent-bit/flb_input.h` around lines 581 - 582, The pthread TLS usage
is mixed up: do not declare pthread_key_t __thread; pick one approach and
reconcile libco_in_param_key and libco_in_param. Either (A) remove pthread_key_t
entirely and make the params struct truly thread-local by declaring
libco_in_param as __thread and replace pthread_getspecific/pthread_setspecific
usage with direct access to libco_in_param, or (B) keep libco_in_param_key as a
single global pthread_key_t (remove __thread from libco_in_param_key and
libco_in_param), move the pthread_key_create call out of flb_input_coro_collect
to a one-time initialization path, and ensure all accesses use
pthread_getspecific/pthread_setspecific with libco_in_param_key; also remove the
unused libco_in_param declaration when using option B.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants