Skip to content

Fix infinite retry loops in flb_tls_net_read/write#11547

Open
spstack wants to merge 1 commit intofluent:masterfrom
spstack:fix-tls-infinite-loop-hang
Open

Fix infinite retry loops in flb_tls_net_read/write#11547
spstack wants to merge 1 commit intofluent:masterfrom
spstack:fix-tls-infinite-loop-hang

Conversation

@spstack
Copy link

@spstack spstack commented Mar 12, 2026

Attempts to fix 100% CPU hang in flb_tls_net_read and flb_tls_net_write when SSL_read/SSL_write returns WANT_READ/WANT_WRITE and no io_timeout is configured.

This seems to happen when connections are closed unexpectedly. It causes a condition where open SSL just repeatedly returns WANT_READ, but there's no new data being received as we've lost the other side, so it just hangs forever and consumes 100% CPU.

If an io_timeout is configured, then everything should be ok as it will eventually give up and return, but the problem arises when the default value of 0 is used for io_timeout as it will spin forever in a tight loop.

Proposing changing this to immediately return if there is no io_timeout configured.


Intended to address this issue, but may address others instances where CPU was 100% consumed
#11551


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change

This is the relevant part of the config that reproduced the issue

pipeline:
  inputs:
    - name: forward
      listen: 0.0.0.0
      port: 24224
      buffer_chunk_size: 1M
      buffer_max_size: 100M
      threaded: True
      storage.type: filesystem
      net.keepalive: on
  • Debug log output from testing the change
    • Before the change, there is no log, but the application just hangs and consumes 100% CPU
    • After the change I'm able to see this log when the issue reproduces indicating I've caught it: [2026/03/17 21:57:22.3944407] [ warn] [tls] Read timed out after 60 seconds
    • I think this proves that the issue is fixed with this change as otherwise, this would have spun forever
  • [N/A] Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • Run local packaging test showing all targets (including any new ones) build.
  • Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • [N/A] Documentation required for this feature

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Summary by CodeRabbit

  • Bug Fixes
    • Improved TLS I/O handling to prevent excessive CPU usage when timeouts are unset or expired.
    • Unified read/write readiness handling to reduce unnecessary retries and spin loops.
    • Ensure progress is reported on interrupted writes (outgoing bytes updated) when operations return due to timeout.
  • Chores
    • Introduced a default TLS I/O timeout (60s) to stabilize read/write behavior.

@coderabbitai
Copy link

coderabbitai bot commented Mar 12, 2026

📝 Walkthrough

Walkthrough

Unifies WANT_READ/WANT_WRITE handling in TLS non-blocking I/O, adds a default IO timeout macro, computes timeout timestamps upfront, checks for expired timeouts to avoid busy loops, and inserts short sleeps on readiness waits in flb_tls_net_read and flb_tls_net_write.

Changes

Cohort / File(s) Summary
TLS core I/O
src/tls/flb_tls.c
Unified handling for FLB_TLS_WANT_READ/FLB_TLS_WANT_WRITE in read/write loops; compute timeout_timestamp early from io_timeout (with default); check current time to return on expired/zero timeouts; add short sleep (5 ms) between readiness retries; update out_len on write timeout.
Public header - default timeout
include/fluent-bit/tls/flb_tls.h
Add macro FLB_TLS_DEFAULT_IO_TIMEOUT_S = 60 after FLB_TLS_SERVER_MODE to provide a default IO timeout value.

Sequence Diagram(s)

sequenceDiagram
    participant App as Application
    participant TLS as flb_tls (read/write)
    participant Sock as Socket/Network
    participant Clock as Timer

    App->>TLS: request read/write
    TLS->>Clock: compute timeout_timestamp (now + io_timeout)
    loop until done or timeout
        TLS->>Sock: attempt read/write
        Sock-->>TLS: WANT_READ / WANT_WRITE / bytes / error
        alt WANT_READ or WANT_WRITE
            TLS->>Clock: check now vs timeout_timestamp
            alt now >= timeout_timestamp
                TLS-->>App: timeout/error (return -1, update out_len if write)
            else
                TLS->>TLS: sleep 5ms, then retry
            end
        else bytes
            TLS-->>App: return bytes read/written
        end
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

backport to v4.2.x

Suggested reviewers

  • edsiper
  • cosmo0920

Poem

🐰 I hopped through sockets in the night,
Timers set early to keep things right,
Reads and writes now share one trail,
A gentle nap between each fail,
I nibble bugs away — and sleep tight 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly describes the main change: fixing infinite retry loops in TLS read/write functions, which is the core issue addressed by the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/tls/flb_tls.c (1)

347-357: Tight CPU spin remains when io_timeout > 0.

The fix correctly prevents infinite loops when io_timeout == 0, but when a timeout is configured and not yet expired, the goto retry_read at line 356 still creates a tight busy-wait loop until either:

  1. The TLS operation succeeds, or
  2. The timeout expires

Compare this to the handshake retry logic at line 699 which uses flb_time_msleep(500) to avoid CPU spin, and the async variant at line 392 which properly yields via flb_coro_yield().

Consider adding a small sleep before retrying to reduce CPU consumption:

♻️ Suggested improvement to add sleep before retry
     if (ret == FLB_TLS_WANT_READ || ret == FLB_TLS_WANT_WRITE) {
         /* If no timeout is configured OR timeout expired, return immediately
          * to let the event loop wait for the socket to be ready.
          * Without this check, we loop forever at 100% CPU when timeout is 0.
          */
         if (timeout_timestamp == 0 || timeout_timestamp <= current_timestamp) {
             return ret;
         }
 
+        flb_time_msleep(10);
         goto retry_read;
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/tls/flb_tls.c` around lines 347 - 357, The
FLB_TLS_WANT_READ/FLB_TLS_WANT_WRITE handling spins when io_timeout > 0 because
it immediately goto retry_read; modify the branch that currently checks
timeout_timestamp/current_timestamp so that, if the timeout hasn't expired, it
yields/sleeps before retrying: call flb_coro_yield() when in async mode (same
context used by the async variant) or call flb_time_msleep(50) (or a small
configurable msleep) for synchronous code, then goto retry_read; reference the
retry_read label, FLB_TLS_WANT_READ/FLB_TLS_WANT_WRITE,
timeout_timestamp/current_timestamp, flb_coro_yield and flb_time_msleep when
making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/tls/flb_tls.c`:
- Around line 461-471: flb_tls_net_write currently returns
FLB_TLS_WANT_WRITE/READ directly which sync-mode callers (flb_io_net_write and
plugins/out_websocket/websocket.c, plugins/out_tcp/tcp.c,
plugins/out_syslog/syslog.c) treat only -1 as error; change behavior so WANT
codes are translated to a standard -1 return with errno set to
EAGAIN/EWOULDBLOCK when the timeout has expired or timeout_timestamp == 0,
ensuring callers' error handling runs; additionally, remove the tight busy-loop
on WANT (the goto retry_write) and instead wait for the socket to become
writable/readable using the existing IO wait primitive (e.g., flb_io_wait or a
select/poll wrapper) until timeout before retrying, then either retry or return
-1/EAGAIN on timeout.

---

Nitpick comments:
In `@src/tls/flb_tls.c`:
- Around line 347-357: The FLB_TLS_WANT_READ/FLB_TLS_WANT_WRITE handling spins
when io_timeout > 0 because it immediately goto retry_read; modify the branch
that currently checks timeout_timestamp/current_timestamp so that, if the
timeout hasn't expired, it yields/sleeps before retrying: call flb_coro_yield()
when in async mode (same context used by the async variant) or call
flb_time_msleep(50) (or a small configurable msleep) for synchronous code, then
goto retry_read; reference the retry_read label,
FLB_TLS_WANT_READ/FLB_TLS_WANT_WRITE, timeout_timestamp/current_timestamp,
flb_coro_yield and flb_time_msleep when making the change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 524eb97e-3b3c-4d74-b785-d073426424f7

📥 Commits

Reviewing files that changed from the base of the PR and between 405fed0 and 62b036d.

📒 Files selected for processing (1)
  • src/tls/flb_tls.c

@spstack spstack force-pushed the fix-tls-infinite-loop-hang branch from 62b036d to aef9239 Compare March 12, 2026 17:07
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 62b036d41e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
src/tls/flb_tls.c (2)

347-356: ⚠️ Potential issue | 🟠 Major

Finite io_timeout still busy-spins until the deadline expires.

Line 356 and Line 471 jump straight back into SSL_read/SSL_write while the deadline is still in the future. That fixes the unbounded loop for io_timeout == 0, but a positive timeout can still pin a core for the full timeout window. Please wait for socket readiness between retries instead of polling the clock in a tight loop.

Also applies to: 461-471


347-353: ⚠️ Potential issue | 🟠 Major

io_timeout == 0 now changes blocking I/O into an immediate WANT return.

src/flb_network.c:151-154 documents io_timeout == 0 as infinite wait, but Line 352 and Line 466 now return FLB_TLS_WANT_* on the first retryable condition without any readiness wait. On the write side, src/flb_io.c:675-715 forwards that value unchanged, so sync callers that only treat -1 as failure can leak a WANT code instead of either blocking or entering normal error handling. Please keep the blocking contract here by waiting for readiness, or normalize the sync return path before it reaches flb_io_net_*.

#!/bin/bash
set -euo pipefail

echo "=== io_timeout default semantics ==="
sed -n '145,160p' src/flb_network.c | nl -ba

echo
echo "=== flb_io_net_write passthrough ==="
sed -n '675,715p' src/flb_io.c | nl -ba

echo
echo "=== flb_io_net_read / flb_io_net_write definitions ==="
rg -n -C3 --type=c '\bint\s+flb_io_net_(read|write)\s*\(' src

echo
echo "=== direct call sites of flb_io_net_write ==="
rg -n -C4 --type=c '\bflb_io_net_write\s*\(' src plugins

echo
echo "=== nearby handling in known sync output plugins ==="
for f in plugins/out_websocket/websocket.c plugins/out_tcp/tcp.c plugins/out_syslog/syslog.c; do
  if [ -f "$f" ]; then
    echo "--- $f ---"
    rg -n -C4 'flb_io_net_write\s*\(|==\s*-1|ret\s*<\s*0' "$f"
  fi
done

echo
echo "=== direct call sites of flb_io_net_read ==="
rg -n -C4 --type=c '\bflb_io_net_read\s*\(' src plugins

Also applies to: 461-468

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/tls/flb_tls.c` around lines 347 - 353, The code returns
FLB_TLS_WANT_READ/WRITE immediately when timeout_timestamp == 0, which breaks
the documented io_timeout==0 (infinite wait) contract; change the condition so
we only return WANT when a timeout is configured and has actually expired (i.e.,
replace the current check that uses "timeout_timestamp == 0 || timeout_timestamp
<= current_timestamp" with a check like "timeout_timestamp != 0 &&
timeout_timestamp <= current_timestamp") for both retryable branches that return
FLB_TLS_WANT_READ/FLB_TLS_WANT_WRITE (use the
timeout_timestamp/current_timestamp checks in flb_tls.c where those WANT codes
are returned), ensuring blocking callers see normal blocking behavior instead of
a stray WANT code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/tls/flb_tls.c`:
- Around line 347-353: The code returns FLB_TLS_WANT_READ/WRITE immediately when
timeout_timestamp == 0, which breaks the documented io_timeout==0 (infinite
wait) contract; change the condition so we only return WANT when a timeout is
configured and has actually expired (i.e., replace the current check that uses
"timeout_timestamp == 0 || timeout_timestamp <= current_timestamp" with a check
like "timeout_timestamp != 0 && timeout_timestamp <= current_timestamp") for
both retryable branches that return FLB_TLS_WANT_READ/FLB_TLS_WANT_WRITE (use
the timeout_timestamp/current_timestamp checks in flb_tls.c where those WANT
codes are returned), ensuring blocking callers see normal blocking behavior
instead of a stray WANT code.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6406de29-357d-472d-8731-a76662d94f3b

📥 Commits

Reviewing files that changed from the base of the PR and between 62b036d and aef9239.

📒 Files selected for processing (1)
  • src/tls/flb_tls.c

@spstack
Copy link
Author

spstack commented Mar 17, 2026

Ok, after some additional thinking/testing, I'm changing tactics here and proposing a different change to address this. There are 2 changes to try to address the problem:

  • Add small delay when tls read returns `` so that CPU is not held while waiting for new data. I don't love adding delays, but it seems appropriate here as other locations in the code do it, and I don't see a better alternative if the calling function expects actual data
  • Add a default timeout if none is specified in net.io_timeout only for this set of functions. This ensures that we never get into a deadlock situation where we're waiting forever.
  • Also added similar timeout code to the flb_tls_net_write function

I think this should address the underlying issue even when no net.io_timeout is specified.

@spstack spstack marked this pull request as ready for review March 17, 2026 22:16
This set of changes addresses an issue where `flb_tls_net_read|write`
functions can hang and consume 100% CPU.

The issue occurs when a TLS connection is lost, and the underlying
openssl implementation repeatedly returns `SSL_ERROR_WANT_READ|WRITE`.

If no `io_timeout` is configured, then the thread will enter a tight infinite
loop retrying the read/write indefinitely until the process is restarted.

This can be addressed by setting net.io_timeout config setting to something
other than the default, but this set of changes attempts to address the
case where no default is specified.

The solution here is to simply default to a high value for the timeout
if the setting is zero. This does not modify the net.io_timeout value,
and only applies to this set of functions. Reasoning is that there
should not be a case where the user would want to spin forever here.

This change also adds a small delay in between retries so that even
for the timeout case, it doesn't load the CPU unnecessarily while
waiting for the next bit of data.

Signed-off-by: Scott Stack <scottstack14@gmail.com>
@spstack spstack force-pushed the fix-tls-infinite-loop-hang branch from 913e5de to 9ff39e6 Compare March 17, 2026 22:18
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 913e5de2df

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 340 to 342
else {
timeout_timestamp = 0;
timeout_value = FLB_TLS_DEFAULT_IO_TIMEOUT_S;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve io_timeout=0 semantics in TLS sync paths

This fallback changes net.io_timeout=0 from “infinite” to a hard 60-second deadline (FLB_TLS_DEFAULT_IO_TIMEOUT_S), so blocking TLS reads/writes can now fail after 60s even when operators intentionally disabled I/O timeouts (the global default is still 0/infinite in src/flb_network.c:153). In environments with long-lived or bursty TLS connections, repeated WANT_* during idle/backpressure will hit this forced deadline and drop otherwise healthy sessions; the same regression is also present in flb_tls_net_write’s else branch.

Useful? React with 👍 / 👎.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose the question here is whether this is a realistic scenario that a user would want to block forever, and if it should be the default. The other option is to have io_timeout default to something non-zero I suppose.

Or do nothing and have users specify io_timeout to fix this.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
include/fluent-bit/tls/flb_tls.h (1)

50-51: LGTM! Clear and well-placed default timeout constant.

The macro follows the existing FLB_TLS_* naming convention and provides a sensible 60-second default to prevent indefinite spinning when io_timeout is unset (zero). This aligns well with the PR objective of fixing the 100% CPU hang.

💡 Optional: Add a brief comment for context
 `#define` FLB_TLS_SERVER_MODE 1

+/* Default I/O timeout (seconds) used when net.io_timeout is not configured */
 `#define` FLB_TLS_DEFAULT_IO_TIMEOUT_S 60
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@include/fluent-bit/tls/flb_tls.h` around lines 50 - 51, Add a short
explanatory comment immediately above the FLB_TLS_DEFAULT_IO_TIMEOUT_S macro
documenting that this 60-second default is used when io_timeout is unset (zero)
to prevent indefinite looping/100% CPU hangs in TLS I/O; reference the
FLB_TLS_DEFAULT_IO_TIMEOUT_S macro name so maintainers understand its purpose
and relation to the io_timeout behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@include/fluent-bit/tls/flb_tls.h`:
- Around line 50-51: Add a short explanatory comment immediately above the
FLB_TLS_DEFAULT_IO_TIMEOUT_S macro documenting that this 60-second default is
used when io_timeout is unset (zero) to prevent indefinite looping/100% CPU
hangs in TLS I/O; reference the FLB_TLS_DEFAULT_IO_TIMEOUT_S macro name so
maintainers understand its purpose and relation to the io_timeout behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6589e3e6-6b3c-4e3c-89b6-4cdceda8da82

📥 Commits

Reviewing files that changed from the base of the PR and between aef9239 and 9ff39e6.

📒 Files selected for processing (2)
  • include/fluent-bit/tls/flb_tls.h
  • src/tls/flb_tls.c
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/tls/flb_tls.c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant