Fix infinite retry loops in flb_tls_net_read/write#11547
Fix infinite retry loops in flb_tls_net_read/write#11547spstack wants to merge 1 commit intofluent:masterfrom
Conversation
📝 WalkthroughWalkthroughUnifies WANT_READ/WANT_WRITE handling in TLS non-blocking I/O, adds a default IO timeout macro, computes timeout timestamps upfront, checks for expired timeouts to avoid busy loops, and inserts short sleeps on readiness waits in Changes
Sequence Diagram(s)sequenceDiagram
participant App as Application
participant TLS as flb_tls (read/write)
participant Sock as Socket/Network
participant Clock as Timer
App->>TLS: request read/write
TLS->>Clock: compute timeout_timestamp (now + io_timeout)
loop until done or timeout
TLS->>Sock: attempt read/write
Sock-->>TLS: WANT_READ / WANT_WRITE / bytes / error
alt WANT_READ or WANT_WRITE
TLS->>Clock: check now vs timeout_timestamp
alt now >= timeout_timestamp
TLS-->>App: timeout/error (return -1, update out_len if write)
else
TLS->>TLS: sleep 5ms, then retry
end
else bytes
TLS-->>App: return bytes read/written
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/tls/flb_tls.c (1)
347-357: Tight CPU spin remains whenio_timeout > 0.The fix correctly prevents infinite loops when
io_timeout == 0, but when a timeout is configured and not yet expired, thegoto retry_readat line 356 still creates a tight busy-wait loop until either:
- The TLS operation succeeds, or
- The timeout expires
Compare this to the handshake retry logic at line 699 which uses
flb_time_msleep(500)to avoid CPU spin, and the async variant at line 392 which properly yields viaflb_coro_yield().Consider adding a small sleep before retrying to reduce CPU consumption:
♻️ Suggested improvement to add sleep before retry
if (ret == FLB_TLS_WANT_READ || ret == FLB_TLS_WANT_WRITE) { /* If no timeout is configured OR timeout expired, return immediately * to let the event loop wait for the socket to be ready. * Without this check, we loop forever at 100% CPU when timeout is 0. */ if (timeout_timestamp == 0 || timeout_timestamp <= current_timestamp) { return ret; } + flb_time_msleep(10); goto retry_read; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/tls/flb_tls.c` around lines 347 - 357, The FLB_TLS_WANT_READ/FLB_TLS_WANT_WRITE handling spins when io_timeout > 0 because it immediately goto retry_read; modify the branch that currently checks timeout_timestamp/current_timestamp so that, if the timeout hasn't expired, it yields/sleeps before retrying: call flb_coro_yield() when in async mode (same context used by the async variant) or call flb_time_msleep(50) (or a small configurable msleep) for synchronous code, then goto retry_read; reference the retry_read label, FLB_TLS_WANT_READ/FLB_TLS_WANT_WRITE, timeout_timestamp/current_timestamp, flb_coro_yield and flb_time_msleep when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/tls/flb_tls.c`:
- Around line 461-471: flb_tls_net_write currently returns
FLB_TLS_WANT_WRITE/READ directly which sync-mode callers (flb_io_net_write and
plugins/out_websocket/websocket.c, plugins/out_tcp/tcp.c,
plugins/out_syslog/syslog.c) treat only -1 as error; change behavior so WANT
codes are translated to a standard -1 return with errno set to
EAGAIN/EWOULDBLOCK when the timeout has expired or timeout_timestamp == 0,
ensuring callers' error handling runs; additionally, remove the tight busy-loop
on WANT (the goto retry_write) and instead wait for the socket to become
writable/readable using the existing IO wait primitive (e.g., flb_io_wait or a
select/poll wrapper) until timeout before retrying, then either retry or return
-1/EAGAIN on timeout.
---
Nitpick comments:
In `@src/tls/flb_tls.c`:
- Around line 347-357: The FLB_TLS_WANT_READ/FLB_TLS_WANT_WRITE handling spins
when io_timeout > 0 because it immediately goto retry_read; modify the branch
that currently checks timeout_timestamp/current_timestamp so that, if the
timeout hasn't expired, it yields/sleeps before retrying: call flb_coro_yield()
when in async mode (same context used by the async variant) or call
flb_time_msleep(50) (or a small configurable msleep) for synchronous code, then
goto retry_read; reference the retry_read label,
FLB_TLS_WANT_READ/FLB_TLS_WANT_WRITE, timeout_timestamp/current_timestamp,
flb_coro_yield and flb_time_msleep when making the change.
62b036d to
aef9239
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 62b036d41e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
♻️ Duplicate comments (2)
src/tls/flb_tls.c (2)
347-356:⚠️ Potential issue | 🟠 MajorFinite
io_timeoutstill busy-spins until the deadline expires.Line 356 and Line 471 jump straight back into
SSL_read/SSL_writewhile the deadline is still in the future. That fixes the unbounded loop forio_timeout == 0, but a positive timeout can still pin a core for the full timeout window. Please wait for socket readiness between retries instead of polling the clock in a tight loop.Also applies to: 461-471
347-353:⚠️ Potential issue | 🟠 Major
io_timeout == 0now changes blocking I/O into an immediate WANT return.
src/flb_network.c:151-154documentsio_timeout == 0as infinite wait, but Line 352 and Line 466 now returnFLB_TLS_WANT_*on the first retryable condition without any readiness wait. On the write side,src/flb_io.c:675-715forwards that value unchanged, so sync callers that only treat-1as failure can leak a WANT code instead of either blocking or entering normal error handling. Please keep the blocking contract here by waiting for readiness, or normalize the sync return path before it reachesflb_io_net_*.#!/bin/bash set -euo pipefail echo "=== io_timeout default semantics ===" sed -n '145,160p' src/flb_network.c | nl -ba echo echo "=== flb_io_net_write passthrough ===" sed -n '675,715p' src/flb_io.c | nl -ba echo echo "=== flb_io_net_read / flb_io_net_write definitions ===" rg -n -C3 --type=c '\bint\s+flb_io_net_(read|write)\s*\(' src echo echo "=== direct call sites of flb_io_net_write ===" rg -n -C4 --type=c '\bflb_io_net_write\s*\(' src plugins echo echo "=== nearby handling in known sync output plugins ===" for f in plugins/out_websocket/websocket.c plugins/out_tcp/tcp.c plugins/out_syslog/syslog.c; do if [ -f "$f" ]; then echo "--- $f ---" rg -n -C4 'flb_io_net_write\s*\(|==\s*-1|ret\s*<\s*0' "$f" fi done echo echo "=== direct call sites of flb_io_net_read ===" rg -n -C4 --type=c '\bflb_io_net_read\s*\(' src pluginsAlso applies to: 461-468
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/tls/flb_tls.c` around lines 347 - 353, The code returns FLB_TLS_WANT_READ/WRITE immediately when timeout_timestamp == 0, which breaks the documented io_timeout==0 (infinite wait) contract; change the condition so we only return WANT when a timeout is configured and has actually expired (i.e., replace the current check that uses "timeout_timestamp == 0 || timeout_timestamp <= current_timestamp" with a check like "timeout_timestamp != 0 && timeout_timestamp <= current_timestamp") for both retryable branches that return FLB_TLS_WANT_READ/FLB_TLS_WANT_WRITE (use the timeout_timestamp/current_timestamp checks in flb_tls.c where those WANT codes are returned), ensuring blocking callers see normal blocking behavior instead of a stray WANT code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/tls/flb_tls.c`:
- Around line 347-353: The code returns FLB_TLS_WANT_READ/WRITE immediately when
timeout_timestamp == 0, which breaks the documented io_timeout==0 (infinite
wait) contract; change the condition so we only return WANT when a timeout is
configured and has actually expired (i.e., replace the current check that uses
"timeout_timestamp == 0 || timeout_timestamp <= current_timestamp" with a check
like "timeout_timestamp != 0 && timeout_timestamp <= current_timestamp") for
both retryable branches that return FLB_TLS_WANT_READ/FLB_TLS_WANT_WRITE (use
the timeout_timestamp/current_timestamp checks in flb_tls.c where those WANT
codes are returned), ensuring blocking callers see normal blocking behavior
instead of a stray WANT code.
aef9239 to
913e5de
Compare
|
Ok, after some additional thinking/testing, I'm changing tactics here and proposing a different change to address this. There are 2 changes to try to address the problem:
I think this should address the underlying issue even when no |
This set of changes addresses an issue where `flb_tls_net_read|write` functions can hang and consume 100% CPU. The issue occurs when a TLS connection is lost, and the underlying openssl implementation repeatedly returns `SSL_ERROR_WANT_READ|WRITE`. If no `io_timeout` is configured, then the thread will enter a tight infinite loop retrying the read/write indefinitely until the process is restarted. This can be addressed by setting net.io_timeout config setting to something other than the default, but this set of changes attempts to address the case where no default is specified. The solution here is to simply default to a high value for the timeout if the setting is zero. This does not modify the net.io_timeout value, and only applies to this set of functions. Reasoning is that there should not be a case where the user would want to spin forever here. This change also adds a small delay in between retries so that even for the timeout case, it doesn't load the CPU unnecessarily while waiting for the next bit of data. Signed-off-by: Scott Stack <scottstack14@gmail.com>
913e5de to
9ff39e6
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 913e5de2df
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| else { | ||
| timeout_timestamp = 0; | ||
| timeout_value = FLB_TLS_DEFAULT_IO_TIMEOUT_S; | ||
| } |
There was a problem hiding this comment.
Preserve io_timeout=0 semantics in TLS sync paths
This fallback changes net.io_timeout=0 from “infinite” to a hard 60-second deadline (FLB_TLS_DEFAULT_IO_TIMEOUT_S), so blocking TLS reads/writes can now fail after 60s even when operators intentionally disabled I/O timeouts (the global default is still 0/infinite in src/flb_network.c:153). In environments with long-lived or bursty TLS connections, repeated WANT_* during idle/backpressure will hit this forced deadline and drop otherwise healthy sessions; the same regression is also present in flb_tls_net_write’s else branch.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
I suppose the question here is whether this is a realistic scenario that a user would want to block forever, and if it should be the default. The other option is to have io_timeout default to something non-zero I suppose.
Or do nothing and have users specify io_timeout to fix this.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
include/fluent-bit/tls/flb_tls.h (1)
50-51: LGTM! Clear and well-placed default timeout constant.The macro follows the existing
FLB_TLS_*naming convention and provides a sensible 60-second default to prevent indefinite spinning whenio_timeoutis unset (zero). This aligns well with the PR objective of fixing the 100% CPU hang.💡 Optional: Add a brief comment for context
`#define` FLB_TLS_SERVER_MODE 1 +/* Default I/O timeout (seconds) used when net.io_timeout is not configured */ `#define` FLB_TLS_DEFAULT_IO_TIMEOUT_S 60🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@include/fluent-bit/tls/flb_tls.h` around lines 50 - 51, Add a short explanatory comment immediately above the FLB_TLS_DEFAULT_IO_TIMEOUT_S macro documenting that this 60-second default is used when io_timeout is unset (zero) to prevent indefinite looping/100% CPU hangs in TLS I/O; reference the FLB_TLS_DEFAULT_IO_TIMEOUT_S macro name so maintainers understand its purpose and relation to the io_timeout behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@include/fluent-bit/tls/flb_tls.h`:
- Around line 50-51: Add a short explanatory comment immediately above the
FLB_TLS_DEFAULT_IO_TIMEOUT_S macro documenting that this 60-second default is
used when io_timeout is unset (zero) to prevent indefinite looping/100% CPU
hangs in TLS I/O; reference the FLB_TLS_DEFAULT_IO_TIMEOUT_S macro name so
maintainers understand its purpose and relation to the io_timeout behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6589e3e6-6b3c-4e3c-89b6-4cdceda8da82
📒 Files selected for processing (2)
include/fluent-bit/tls/flb_tls.hsrc/tls/flb_tls.c
🚧 Files skipped from review as they are similar to previous changes (1)
- src/tls/flb_tls.c
Attempts to fix 100% CPU hang in flb_tls_net_read and flb_tls_net_write when SSL_read/SSL_write returns WANT_READ/WANT_WRITE and no io_timeout is configured.
This seems to happen when connections are closed unexpectedly. It causes a condition where open SSL just repeatedly returns
WANT_READ, but there's no new data being received as we've lost the other side, so it just hangs forever and consumes 100% CPU.If an
io_timeoutis configured, then everything should be ok as it will eventually give up and return, but the problem arises when the default value of0is used forio_timeoutas it will spin forever in a tight loop.Proposing changing this to immediately return if there is no io_timeout configured.
Intended to address this issue, but may address others instances where CPU was 100% consumed
#11551
Enter
[N/A]in the box, if an item is not applicable to your change.Testing
Before we can approve your change; please submit the following in a comment:
This is the relevant part of the config that reproduced the issue
[2026/03/17 21:57:22.3944407] [ warn] [tls] Read timed out after 60 secondsIf this is a change to packaging of containers or native binaries then please confirm it works for all targets.
ok-package-testlabel to test for all targets (requires maintainer to do).Documentation
Backporting
Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.
Summary by CodeRabbit