Skip to content

Subscription failure during Registry outage or topology instability #245

@adhdtech

Description

@adhdtech

If a subscription is registered while Registries are down, the subscription will fail and not be retried.

Cause of failure - Registry Outage:
If the Broker nodes are not directly connected to the targets, the registrations will fail since the node is aware of the targets but there is no channel for the back request. When the node does become reachable, a subscription event is not triggered since the "LearnedFrom" is updated but a message does not get sent to the TopologyTracker.

Cause of failure - Topology Instability
If Registry nodes are unavailable then become available, registrations can fail due to the path being changed while the registration command is in-flight.

Steps to replicate:

  • Run startTestEnvironment script
  • Log into Broker1 (zone1) and Broker2 (zone2)
  • Terminate Registry nodes
  • On each Broker node, open DRP shell and execute "watch -g TestStream"
  • Restart Registry nodes

Ideas for remediation:

  • Update subscription topology monitoring to detect whether or not a node is reachable. If it becomes reachable, re-evaluate subscriptions.
  • On loss of control plane, remove Nodes from topology tables that are not connected and unreachable.
  • Implement RPC retries
  • Implement distance vector algorithm
  • Add delay to subscriptions so they wait for stability after reconnecting to control plane

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions