Move input coroutine management variables to TLS#11576
Move input coroutine management variables to TLS#11576runderwo wants to merge 1 commit intofluent:masterfrom
Conversation
📝 WalkthroughWalkthroughTwo global variables ( Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 24bbbd6465
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| extern pthread_key_t __thread libco_in_param_key; | ||
| extern struct flb_libco_in_params __thread libco_in_param; |
There was a problem hiding this comment.
Guard libco TLS declarations behind FLB_HAVE_C_TLS
CMakeLists.txt explicitly supports builds where compiler TLS is unavailable (or forced off with FLB_POSIX_TLS), but these declarations now unconditionally require __thread. That breaks the non-C-TLS build path by making flb_input.h depend on compiler TLS syntax in all configurations, so affected targets fail at compile time before the pthread-key fallback can be used. Please switch this to the existing FLB_TLS_* abstraction or gate it with #ifdef FLB_HAVE_C_TLS.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
We should use the existing TLS abstraction or conditionally compile the declarations like other sources do.
There was a problem hiding this comment.
If conditional compilation is required, I think that means this approach is broken since the write conflict would still exist on any system without TLS. Should we protect the key variable with a mutex instead?
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@include/fluent-bit/flb_input.h`:
- Around line 617-618: Change input_params_set() to return an int error code (0
on success, non-zero on failure) and propagate allocation/pthread_setspecific
errors from flb_calloc and pthread_setspecific instead of silently swallowing
them; update callers (notably flb_input_coro_collect()) to check the return
value and on failure unwind/cleanup coroutine creation and any allocated
resources. Remove the unsafe flb_bug(params == NULL) assertion in
input_pre_cb_collect() and replace it with proper error handling (return an
error/fatal log if params is NULL after successful bootstrap) so coroutine
resume won’t abort; also stop treating pthread_key_t as __thread—ensure
pthread_key_create() is done once globally (not per-coroutine) and use
pthread_setspecific/pthread_getspecific for per-coroutine storage. Ensure
references to input_params_set, flb_calloc, pthread_setspecific,
flb_input_coro_collect, input_pre_cb_collect, pthread_key_t, and
pthread_key_create are updated accordingly.
In `@src/flb_input.c`:
- Around line 55-56: The pthread_key_t was incorrectly declared __thread which
gives each thread its own key handle and causes pthread_key_create() (called in
the collect path) to leak process-wide TSD keys; remove the __thread qualifier
from libco_in_param_key so there is a single process-wide key, and change the
initialization logic so pthread_key_create(libco_in_param_key, ...) is performed
exactly once (e.g., guarded by a static/atomic "initialized" flag in the module
init or in the function that currently creates the key) or alternatively remove
the pthread_key_t usage entirely and use __thread libco_in_param directly;
update any code paths that assume per-collect key creation (the code invoking
pthread_key_create and the flb_bug(params == NULL) error path) to reflect the
single-key or thread-local-only approach.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7b3ef4a8-cfcc-4caa-8f2a-dff5889ac801
📒 Files selected for processing (2)
include/fluent-bit/flb_input.hsrc/flb_input.c
| flb_bug(params == NULL); | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
nl -ba include/fluent-bit/flb_input.h | sed -n '585,670p'Repository: fluent/fluent-bit
Length of output: 103
🏁 Script executed:
#!/bin/bash
# Read lines 585-670 from flb_input.h
cat -n include/fluent-bit/flb_input.h | sed -n '585,670p'Repository: fluent/fluent-bit
Length of output: 3099
🏁 Script executed:
#!/bin/bash
# Check the extern declarations and __thread usage
head -100 include/fluent-bit/flb_input.h | cat -nRepository: fluent/fluent-bit
Length of output: 4123
🏁 Script executed:
#!/bin/bash
# Search for libco_in_param_key declaration
rg "libco_in_param_key" include/fluent-bit/flb_input.h -B 2 -A 2Repository: fluent/fluent-bit
Length of output: 908
🏁 Script executed:
#!/bin/bash
# Search for libco_in_param_key definition in the source files
rg "libco_in_param_key" src/ --type c -B 1 -A 1 | head -50Repository: fluent/fluent-bit
Length of output: 204
Eliminate the unsafe assertion by propagating allocation/setup errors.
The assertion at line 617 assumes params is set, but input_params_set() returns void and silently fails if flb_calloc() or pthread_setspecific() fails. When the coroutine resumes and calls input_pre_cb_collect(), the assertion aborts instead of failing cleanly. Additionally, the __thread declaration of pthread_key_t is incorrect—pthread keys are global identifiers, not thread-local variables—and pthread_key_create() should not be called per-coroutine.
Required changes:
- Make
input_params_set()return an error code (e.g.,intwith 0 for success) - Check the return value in
flb_input_coro_collect()and unwind coroutine creation on failure - Replace the assertion in
input_pre_cb_collect()with proper error handling or a fatal log if params is NULL (after confirming bootstrap succeeded)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@include/fluent-bit/flb_input.h` around lines 617 - 618, Change
input_params_set() to return an int error code (0 on success, non-zero on
failure) and propagate allocation/pthread_setspecific errors from flb_calloc and
pthread_setspecific instead of silently swallowing them; update callers (notably
flb_input_coro_collect()) to check the return value and on failure
unwind/cleanup coroutine creation and any allocated resources. Remove the unsafe
flb_bug(params == NULL) assertion in input_pre_cb_collect() and replace it with
proper error handling (return an error/fatal log if params is NULL after
successful bootstrap) so coroutine resume won’t abort; also stop treating
pthread_key_t as __thread—ensure pthread_key_create() is done once globally (not
per-coroutine) and use pthread_setspecific/pthread_getspecific for per-coroutine
storage. Ensure references to input_params_set, flb_calloc, pthread_setspecific,
flb_input_coro_collect, input_pre_cb_collect, pthread_key_t, and
pthread_key_create are updated accordingly.
| struct flb_libco_in_params __thread libco_in_param; | ||
| pthread_key_t __thread libco_in_param_key; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n -C3 '\blibco_in_param_key\b|pthread_key_create\s*\(|pthread_key_delete\s*\(|pthread_once\s*\(|pthread_(get|set)specific\s*\(' include/fluent-bit/flb_input.h src/flb_input.cRepository: fluent/fluent-bit
Length of output: 2526
Do not make the pthread_key_t itself __thread.
pthread_key_t already identifies a thread-specific slot. With this change, each worker owns a different key handle, while Line 652 in include/fluent-bit/flb_input.h still calls pthread_key_create() on every collect. That leaks one process-wide TSD key per collect and never deletes it, so long-running inputs can exhaust PTHREAD_KEYS_MAX and then fall into the unchecked setup / flb_bug(params == NULL) path at line 617. Keep one process-wide key initialized once, or remove the key entirely and use __thread libco_in_param directly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/flb_input.c` around lines 55 - 56, The pthread_key_t was incorrectly
declared __thread which gives each thread its own key handle and causes
pthread_key_create() (called in the collect path) to leak process-wide TSD keys;
remove the __thread qualifier from libco_in_param_key so there is a single
process-wide key, and change the initialization logic so
pthread_key_create(libco_in_param_key, ...) is performed exactly once (e.g.,
guarded by a static/atomic "initialized" flag in the module init or in the
function that currently creates the key) or alternatively remove the
pthread_key_t usage entirely and use __thread libco_in_param directly; update
any code paths that assume per-collect key creation (the code invoking
pthread_key_create and the flb_bug(params == NULL) error path) to reflect the
single-key or thread-local-only approach.
There was a problem hiding this comment.
@cosmo0920 Also looking for suggestions here. Removing the key entirely and moving the coroutine params to TLS seems to be the most sensible path, since as the review mentioned, the key is leaked anyway (I observed this during debugging).
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
In c98a68e, coroutine state was moved to thread-local storage. However, the variables introduced to manage thread-local coroutine state were not themselves thread-local, which is a problem because flb_input_coro_collect() is not guarded against re-entrancy. This results in a race when multiple threads wake up from polling and run, because they then both read (e.g. in input_pre_cb_collect()) and overwrite (e.g. in flb_input_coro_collect()) the TLS key used to look up the stored coroutine state (libco_in_param_key), without synchronization. There are almost certainly additional bugs here but this at least mitigates the race condition. The error handling logic in input_pre_cb_collect() was removed and replaced with an assert because the recovery path made no sense, and it's not clear there is in fact a sensible recovery path given the code structure. Most likely this is related to fluent#11488 and may be the root cause. Signed-off-by: Ryan Underwood <ryan.underwood@mongodb.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@include/fluent-bit/flb_input.h`:
- Around line 581-582: The pthread TLS usage is mixed up: do not declare
pthread_key_t __thread; pick one approach and reconcile libco_in_param_key and
libco_in_param. Either (A) remove pthread_key_t entirely and make the params
struct truly thread-local by declaring libco_in_param as __thread and replace
pthread_getspecific/pthread_setspecific usage with direct access to
libco_in_param, or (B) keep libco_in_param_key as a single global pthread_key_t
(remove __thread from libco_in_param_key and libco_in_param), move the
pthread_key_create call out of flb_input_coro_collect to a one-time
initialization path, and ensure all accesses use
pthread_getspecific/pthread_setspecific with libco_in_param_key; also remove the
unused libco_in_param declaration when using option B.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0f8e9437-bfb6-44b7-b362-aa18ff502522
📒 Files selected for processing (2)
include/fluent-bit/flb_input.hsrc/flb_input.c
🚧 Files skipped from review as they are similar to previous changes (1)
- src/flb_input.c
| extern pthread_key_t __thread libco_in_param_key; | ||
| extern struct flb_libco_in_params __thread libco_in_param; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify the definition in the source file matches the header declaration
rg -n '__thread.*libco_in_param' src/flb_input.c -B1 -A1
# Check how pthread_key_create is called - should be once, not per-coroutine
ast-grep --pattern 'pthread_key_create($_, $_)'Repository: fluent/fluent-bit
Length of output: 3133
🏁 Script executed:
# Find the function containing line 652 in flb_input.h
head -700 include/fluent-bit/flb_input.h | tail -60 | cat -n
# Check if this is in a per-coroutine or initialization function
rg -n 'flb_input_coro_collect|pthread_key_create' include/fluent-bit/flb_input.h -B5 -A5
# Check how libco_in_param is actually accessed in the code
rg -n 'libco_in_param' src/flb_input.c | head -30Repository: fluent/fluent-bit
Length of output: 3082
🏁 Script executed:
# Check how libco_in_param_key is used - getspecific/setspecific or direct access
rg -n 'pthread_getspecific|pthread_setspecific|libco_in_param' src/flb_input.c -B2 -A2 | head -60
# Also check include files
rg -n 'libco_in_param' include/fluent-bit/flb_input.h -B2 -A2Repository: fluent/fluent-bit
Length of output: 1538
Critical: pthread_key_t must not be __thread—this breaks the TLS mechanism.
A pthread_key_t is a global identifier/handle shared by all threads to access their respective thread-local values via pthread_getspecific/pthread_setspecific. Making it __thread means each thread gets its own key variable. Combined with pthread_key_create at line 652 being called inside flb_input_coro_collect() (not at initialization), each thread creates a different key every time a coroutine is collected—completely breaking the shared-key model.
Choose one TLS approach:
- Use
__threaddirectly on the params struct (eliminatepthread_key_tentirely), OR - Keep
pthread_key_tglobal (not__thread) and callpthread_key_createexactly once at initialization (not per-coroutine)
Additionally, libco_in_param at line 582 is declared with __thread but never used—the code uses pthread_getspecific(libco_in_param_key) instead. This unused declaration suggests the two TLS approaches are being conflated and should be reconciled.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@include/fluent-bit/flb_input.h` around lines 581 - 582, The pthread TLS usage
is mixed up: do not declare pthread_key_t __thread; pick one approach and
reconcile libco_in_param_key and libco_in_param. Either (A) remove pthread_key_t
entirely and make the params struct truly thread-local by declaring
libco_in_param as __thread and replace pthread_getspecific/pthread_setspecific
usage with direct access to libco_in_param, or (B) keep libco_in_param_key as a
single global pthread_key_t (remove __thread from libco_in_param_key and
libco_in_param), move the pthread_key_create call out of flb_input_coro_collect
to a one-time initialization path, and ensure all accesses use
pthread_getspecific/pthread_setspecific with libco_in_param_key; also remove the
unused libco_in_param declaration when using option B.
In c98a68e, coroutine state was moved to thread-local storage. However, the variables introduced to manage thread-local coroutine state were not themselves thread-local, which is a problem because
flb_input_coro_collect()is not guarded against re-entrancy. This results in a race when multiple threads wake up from polling and run, because they then both read (e.g. ininput_pre_cb_collect()) and overwrite (e.g. inflb_input_coro_collect()) the TLS key used to look up the stored coroutine state (libco_in_param_key), without synchronization.There are almost certainly additional bugs here but this at least mitigates the race condition. The error handling logic in
input_pre_cb_collect()was removed and replaced with an assert because the recovery path made no sense, and it's not clear there is in fact a sensible recovery path given the code structure.Most likely this is related to #11488 and may be the root cause.
Enter
[N/A]in the box, if an item is not applicable to your change.Testing
Before we can approve your change; please submit the following in a comment:
Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.
Summary by CodeRabbit