-
Notifications
You must be signed in to change notification settings - Fork 29
[INFRA-9517] handle auth failures with immediate reconnection #105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@jsadn Can you outline the issue this PR is addressing in more detail? I don't understand how to read your paste dump in the description. |
|
@blakesmith i found the verbose example helpful but updated to summarize better. |
blakesmith
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Main things to discuss:
- Exception design. I'm not a fan of using subclasses to communicate error state, but that could be up for debate. Curious to hear what others think here. It's important to get this right since this forms an error interface that needs stability, and we should have a pattern not just for auth failures, but other failure types that allows callers to handle different error states bubbling up from the client.
- Pluggable reconnect behavior, perhaps not just in cases of auth failure. Would like to think about this more generally. It might be better to have a pluggable
DisconnectBehaviorinterface to the client, that could also handle other errors that occur and drive retryable behavior, or behavior where we should abort. - Building on point 2, what about the case of legitimate bad credentials. Will that just bombard and DoS the auth server with this current implementation?
| @Override | ||
| protected void handleAuthFailure() { | ||
| // Trigger immediate reconnection when auth session expires | ||
| subscription.getSubscriber().immediateCheckConnections(topic); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to make reconnect behavior pluggable. A few different ways a caller might want to handle authentication failures:
- Keep existing behavior, let auth failures bubble out of the client.
- Exponential / linear retry backoff.
For the case of actual bad credentials, won't an "immediate reconnect" bombard the auth server with infinite retries as fast as it can? At a minimum, we should probably give up after a certain number of authentication failures, since this would happen in the case of legitimate bad credentials.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i considered this early on but is not an issue because checkConnections() :
- attempts to create a new connection
- via
read()will encounter the exception - never adds the connection to
Subscription.connectionMap while (isReading)is not running so there is not infinite loop.
instead, we will seelogger.error("error connecting to:{}", activeHost, e);every minute - this is current behavior.
i would like to push back on needing to generalize reconnection behavior because the ripple effect is large and the established pattern is deeply rooted. |
issue: all of our nsq consumers log
errorE_AUTH_FAILEDperiodically and intermittently on average once a week. after thorough investigation the root cause appears to be discrepancy in state between NSQ auth session and TCP connection when using nsq with auth.in this PR we: