[INFRA-9517] handle auth failures with immediate reconnection #105

jsadn · 2025-10-15T14:55:46Z

issue: all of our nsq consumers log error E_AUTH_FAILED periodically and intermittently on average once a week. after thorough investigation the root cause appears to be discrepancy in state between NSQ auth session and TCP connection when using nsq with auth.

in this PR we:

catch this error
close connection.
trigger reconnection (this happens every minute on a scheduled task right now)
dont log error

  What Happens:

  Hour 1: Connection established
    Client: "Here's my TCP handshake" ✓
    Client: "AUTH token123" ✓
    NSQ: "OK, you're authenticated" → Stores in memory

  Hour 2-10: Everything works
    Client: "NOP" (heartbeat) ✓
    NSQ: "OK, TCP connection alive" ✓
    Client: "SUB topic channel" ✓
    NSQ: Checks memory: "Is this connection authed?" ✓ → Allow

  Hour 11: NSQ internal cleanup runs
    NSQ: "Clean up auth sessions older than 10 hours"
    NSQ: Removes auth state from memory
    TCP connection: Still open! Still sending heartbeats!

  Hour 11+:
    Client: "NOP" (heartbeat) ✓
    NSQ: "OK, TCP connection alive" ✓  ← NSQ doesn't check auth for heartbeats

    Client: "SUB topic channel"
    NSQ: Checks memory: "Is this connection authed?" ✗
    NSQ: "E_AUTH_FAILED"

blakesmith · 2025-10-15T15:05:02Z

@jsadn Can you outline the issue this PR is addressing in more detail? I don't understand how to read your paste dump in the description.

jsadn · 2025-10-15T15:41:47Z

@blakesmith i found the verbose example helpful but updated to summarize better.

blakesmith

Main things to discuss:

Exception design. I'm not a fan of using subclasses to communicate error state, but that could be up for debate. Curious to hear what others think here. It's important to get this right since this forms an error interface that needs stability, and we should have a pattern not just for auth failures, but other failure types that allows callers to handle different error states bubbling up from the client.
Pluggable reconnect behavior, perhaps not just in cases of auth failure. Would like to think about this more generally. It might be better to have a pluggable DisconnectBehavior interface to the client, that could also handle other errors that occur and drive retryable behavior, or behavior where we should abort.
Building on point 2, what about the case of legitimate bad credentials. Will that just bombard and DoS the auth server with this current implementation?

src/main/java/com/sproutsocial/nsq/AuthFailedException.java

blakesmith · 2025-10-15T16:30:03Z

src/main/java/com/sproutsocial/nsq/SubConnection.java

+    @Override
+    protected void handleAuthFailure() {
+        // Trigger immediate reconnection when auth session expires
+        subscription.getSubscriber().immediateCheckConnections(topic);


We need to make reconnect behavior pluggable. A few different ways a caller might want to handle authentication failures:

Keep existing behavior, let auth failures bubble out of the client.

Exponential / linear retry backoff.

For the case of actual bad credentials, won't an "immediate reconnect" bombard the auth server with infinite retries as fast as it can? At a minimum, we should probably give up after a certain number of authentication failures, since this would happen in the case of legitimate bad credentials.

i considered this early on but is not an issue because checkConnections() :

attempts to create a new connection

via read() will encounter the exception

never adds the connection to Subscription.connectionMap

while (isReading) is not running so there is not infinite loop.
instead, we will see logger.error("error connecting to:{}", activeHost, e); every minute - this is current behavior.

jsadn · 2025-10-15T18:20:38Z

@blakesmith

explained why when out of a reading loop there is no infinite loop and added test to verify
refactored from subclasses to error codes enum

i would like to push back on needing to generalize reconnection behavior because the ripple effect is large and the established pattern is deeply rooted. Subscriber reconnection behavior is a "periodic lookup" via lookupIntervalSecs which calls checkConnections. by default this happens every 60 seconds which i assume most users use. So most usages, with no configuration do "check connections every minute and reconnect if needed". we are addressing a bug by saying "check connections every minute and reconnect if needed, but in the event of a transient authentication failure do it immediately instead of waiting for the minute interval".

jsadn added 2 commits October 15, 2025 09:40

handle auth failures with immediate reconnection

e10192d

handle auth failures with immediate reconnection

f76d6c4

jsadn requested review from blakesmith, david-huber and micahben October 15, 2025 14:55

blakesmith reviewed Oct 15, 2025

View reviewed changes

jsadn added 2 commits October 15, 2025 13:10

factor into error codes

90c78c4

major version

3d48d0e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[INFRA-9517] handle auth failures with immediate reconnection #105

[INFRA-9517] handle auth failures with immediate reconnection #105

Uh oh!

jsadn commented Oct 15, 2025 •

edited

Loading

Uh oh!

blakesmith commented Oct 15, 2025

Uh oh!

jsadn commented Oct 15, 2025

Uh oh!

blakesmith left a comment

Uh oh!

Uh oh!

blakesmith Oct 15, 2025

Uh oh!

jsadn Oct 15, 2025

Uh oh!

jsadn commented Oct 15, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

[INFRA-9517] handle auth failures with immediate reconnection #105

Are you sure you want to change the base?

[INFRA-9517] handle auth failures with immediate reconnection #105

Uh oh!

Conversation

jsadn commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

blakesmith commented Oct 15, 2025

Uh oh!

jsadn commented Oct 15, 2025

Uh oh!

blakesmith left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

blakesmith Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

jsadn Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

jsadn commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

jsadn commented Oct 15, 2025 •

edited

Loading

jsadn commented Oct 15, 2025 •

edited

Loading