-
Notifications
You must be signed in to change notification settings - Fork 51
LWT routing: Tablet path loses natural token-ring order (Paxos leader not prioritized) #781
Description
Problem
When tablets are enabled, LWT queries lose their natural token-ring replica ordering, meaning the Paxos leader is not prioritized. This adds an extra network hop and increases latency for every LWT query routed through the tablet path.
Root Cause
In TokenAwarePolicy.make_query_plan() (cassandra/policies.py:509-513), the tablet code path constructs the replica list from the child policy's query plan order rather than the natural token-ring order:
if tablet is not None:
replicas_mapped = set(map(lambda r: r[0], tablet.replicas))
child_plan = child.make_query_plan(keyspace, query)
replicas = [host for host in child_plan if host.host_id in replicas_mapped]The child policy's make_query_plan() yields hosts in round-robin order (starting from a rotating position). Filtering this by tablet membership preserves the round-robin order, not the token-ring order. The Paxos leader (first natural replica) can end up at any position in the list.
Even though LWT queries skip the shuffle(replicas) call at line 517-518, the ordering is already wrong because it came from the child policy, not from the token map.
Impact
- Affects all child policies when tablets are enabled (unlike LWT routing: RackAwareRoundRobinPolicy demotes Paxos leader #780 which only affects
RackAwareRoundRobinPolicy) - Extra network hop per LWT query (non-leader must forward Paxos proposal)
- Increased Paxos latency and contention
- The
tablet.replicasfield contains the correct replica order, but it's only used as a set for membership testing, discarding the ordering information
Proposed Fix
For LWT queries on the tablet path, use the order from tablet.replicas directly instead of the child policy's round-robin order:
if tablet is not None:
if query is not None and query.is_lwt():
# For LWT, preserve tablet replica order (= ring order) for Paxos leader affinity
replicas = [host for host_id, _ in tablet.replicas
for host in [self._cluster_metadata.get_host_by_host_id(host_id)]
if host is not None]
else:
replicas_mapped = set(map(lambda r: r[0], tablet.replicas))
child_plan = child.make_query_plan(keyspace, query)
replicas = [host for host in child_plan if host.host_id in replicas_mapped]The LWT path should also bypass yield_in_order() as described in #780.
Related
- LWT routing: RackAwareRoundRobinPolicy demotes Paxos leader #780 —
RackAwareRoundRobinPolicydemotes Paxos leader for LWT (non-tablet path) - PR (improvement)Optimize DCAware/RackAware/TokenAware/HostFilter policies with host distance caching and overall perf. improvements #651 — Query plan optimization (distance caching, LRU for token-to-replicas)