-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
If a subscription is registered while Registries are down, the subscription will fail and not be retried.
Cause of failure - Registry Outage:
If the Broker nodes are not directly connected to the targets, the registrations will fail since the node is aware of the targets but there is no channel for the back request. When the node does become reachable, a subscription event is not triggered since the "LearnedFrom" is updated but a message does not get sent to the TopologyTracker.
Cause of failure - Topology Instability
If Registry nodes are unavailable then become available, registrations can fail due to the path being changed while the registration command is in-flight.
Steps to replicate:
- Run startTestEnvironment script
- Log into Broker1 (zone1) and Broker2 (zone2)
- Terminate Registry nodes
- On each Broker node, open DRP shell and execute "watch -g TestStream"
- Restart Registry nodes
Ideas for remediation:
- Update subscription topology monitoring to detect whether or not a node is reachable. If it becomes reachable, re-evaluate subscriptions.
- On loss of control plane, remove Nodes from topology tables that are not connected and unreachable.
- Implement RPC retries
- Implement distance vector algorithm
- Add delay to subscriptions so they wait for stability after reconnecting to control plane
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working