Skip to content

LWT routing: Tablet path loses natural token-ring order (Paxos leader not prioritized) #781

@mykaul

Description

@mykaul

Problem

When tablets are enabled, LWT queries lose their natural token-ring replica ordering, meaning the Paxos leader is not prioritized. This adds an extra network hop and increases latency for every LWT query routed through the tablet path.

Root Cause

In TokenAwarePolicy.make_query_plan() (cassandra/policies.py:509-513), the tablet code path constructs the replica list from the child policy's query plan order rather than the natural token-ring order:

if tablet is not None:
    replicas_mapped = set(map(lambda r: r[0], tablet.replicas))
    child_plan = child.make_query_plan(keyspace, query)
    replicas = [host for host in child_plan if host.host_id in replicas_mapped]

The child policy's make_query_plan() yields hosts in round-robin order (starting from a rotating position). Filtering this by tablet membership preserves the round-robin order, not the token-ring order. The Paxos leader (first natural replica) can end up at any position in the list.

Even though LWT queries skip the shuffle(replicas) call at line 517-518, the ordering is already wrong because it came from the child policy, not from the token map.

Impact

  • Affects all child policies when tablets are enabled (unlike LWT routing: RackAwareRoundRobinPolicy demotes Paxos leader #780 which only affects RackAwareRoundRobinPolicy)
  • Extra network hop per LWT query (non-leader must forward Paxos proposal)
  • Increased Paxos latency and contention
  • The tablet.replicas field contains the correct replica order, but it's only used as a set for membership testing, discarding the ordering information

Proposed Fix

For LWT queries on the tablet path, use the order from tablet.replicas directly instead of the child policy's round-robin order:

if tablet is not None:
    if query is not None and query.is_lwt():
        # For LWT, preserve tablet replica order (= ring order) for Paxos leader affinity
        replicas = [host for host_id, _ in tablet.replicas 
                     for host in [self._cluster_metadata.get_host_by_host_id(host_id)]
                     if host is not None]
    else:
        replicas_mapped = set(map(lambda r: r[0], tablet.replicas))
        child_plan = child.make_query_plan(keyspace, query)
        replicas = [host for host in child_plan if host.host_id in replicas_mapped]

The LWT path should also bypass yield_in_order() as described in #780.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions