Skip to content

Conversation

@jsadn
Copy link
Contributor

@jsadn jsadn commented Oct 15, 2025

issue: all of our nsq consumers log error E_AUTH_FAILED periodically and intermittently on average once a week. after thorough investigation the root cause appears to be discrepancy in state between NSQ auth session and TCP connection when using nsq with auth.

in this PR we:

  1. catch this error
  2. close connection.
  3. trigger reconnection (this happens every minute on a scheduled task right now)
  4. dont log error
  What Happens:

  Hour 1: Connection established
    Client: "Here's my TCP handshake" ✓
    Client: "AUTH token123" ✓
    NSQ: "OK, you're authenticated" → Stores in memory

  Hour 2-10: Everything works
    Client: "NOP" (heartbeat) ✓
    NSQ: "OK, TCP connection alive" ✓
    Client: "SUB topic channel" ✓
    NSQ: Checks memory: "Is this connection authed?" ✓ → Allow

  Hour 11: NSQ internal cleanup runs
    NSQ: "Clean up auth sessions older than 10 hours"
    NSQ: Removes auth state from memory
    TCP connection: Still open! Still sending heartbeats!

  Hour 11+:
    Client: "NOP" (heartbeat) ✓
    NSQ: "OK, TCP connection alive" ✓  ← NSQ doesn't check auth for heartbeats

    Client: "SUB topic channel"
    NSQ: Checks memory: "Is this connection authed?" ✗
    NSQ: "E_AUTH_FAILED"

@blakesmith
Copy link
Contributor

@jsadn Can you outline the issue this PR is addressing in more detail? I don't understand how to read your paste dump in the description.

@jsadn
Copy link
Contributor Author

jsadn commented Oct 15, 2025

@blakesmith i found the verbose example helpful but updated to summarize better.

Copy link
Contributor

@blakesmith blakesmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main things to discuss:

  1. Exception design. I'm not a fan of using subclasses to communicate error state, but that could be up for debate. Curious to hear what others think here. It's important to get this right since this forms an error interface that needs stability, and we should have a pattern not just for auth failures, but other failure types that allows callers to handle different error states bubbling up from the client.
  2. Pluggable reconnect behavior, perhaps not just in cases of auth failure. Would like to think about this more generally. It might be better to have a pluggable DisconnectBehavior interface to the client, that could also handle other errors that occur and drive retryable behavior, or behavior where we should abort.
  3. Building on point 2, what about the case of legitimate bad credentials. Will that just bombard and DoS the auth server with this current implementation?

@Override
protected void handleAuthFailure() {
// Trigger immediate reconnection when auth session expires
subscription.getSubscriber().immediateCheckConnections(topic);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to make reconnect behavior pluggable. A few different ways a caller might want to handle authentication failures:

  1. Keep existing behavior, let auth failures bubble out of the client.
  2. Exponential / linear retry backoff.

For the case of actual bad credentials, won't an "immediate reconnect" bombard the auth server with infinite retries as fast as it can? At a minimum, we should probably give up after a certain number of authentication failures, since this would happen in the case of legitimate bad credentials.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i considered this early on but is not an issue because checkConnections() :

  1. attempts to create a new connection
  2. via read() will encounter the exception
  3. never adds the connection to Subscription.connectionMap
  4. while (isReading) is not running so there is not infinite loop.
    instead, we will see logger.error("error connecting to:{}", activeHost, e); every minute - this is current behavior.

@jsadn
Copy link
Contributor Author

jsadn commented Oct 15, 2025

@blakesmith

  • explained why when out of a reading loop there is no infinite loop and added test to verify
  • refactored from subclasses to error codes enum

i would like to push back on needing to generalize reconnection behavior because the ripple effect is large and the established pattern is deeply rooted. Subscriber reconnection behavior is a "periodic lookup" via lookupIntervalSecs which calls checkConnections. by default this happens every 60 seconds which i assume most users use. So most usages, with no configuration do "check connections every minute and reconnect if needed". we are addressing a bug by saying "check connections every minute and reconnect if needed, but in the event of a transient authentication failure do it immediately instead of waiting for the minute interval".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants