Summary
The Spanner DBAPI layer (spanner_dbapi) always retries aborted transactions internally by replaying all recorded statements and validating checksums. There is no way to disable this behavior. Applications that implement their own transaction retry logic (re-invoking a callable with a fresh session on abort) experience nested retry loops that cause severe contention amplification under concurrent writes.
Background
When commit() receives an Aborted exception from Spanner, the DBAPI enters an internal retry loop in TransactionRetryHelper.retry_transaction(). This loop replays all statements recorded during the transaction and validates checksums of read results to ensure consistency. It retries up to 50 times with exponential backoff.
This mechanism was designed for Django and other PEP 249 ORMs that build transactions incrementally through individual cursor.execute() calls (original motivation: googleapis/python-spanner-django#34). In this model, the DBAPI layer is the only component that can retry — the ORM has no concept of "re-run this transaction from scratch."
However, many applications use a different pattern: wrapping the entire transaction in a callable and re-invoking it on abort (similar to Session.run_in_transaction). For these applications, the internal retry is unnecessary and harmful.
The nested retry problem
When an application wraps transactions in its own retry loop and the DBAPI also retries internally, the two layers interfere:
-
Contention amplification (thundering herd): The internal replay re-acquires locks on the same rows that caused the original abort. Under concurrent writes, each replay attempt can abort another thread's replay, leading to exponential retry growth across threads.
-
Wasted wall-clock time: The internal retry loop accumulates 13–19 seconds of lock wait time (observed in production with 10 concurrent writers) before finally raising RetryAborted. The outer application retry then starts fresh, having wasted all that time.
-
Checksum mismatches on contended rows: For read-modify-write patterns, replayed reads almost always return different data (because another transaction committed in between), causing _compare_checksums() to fail. The internal retry is structurally unable to succeed in this scenario — it always falls through to RetryAborted after exhausting retries.
Relevant code paths
Timeline
Proposed Change
Add a retry_aborts_internally parameter to Connection and connect(), following the same pattern used for read_only and request_priority:
- Default
True — preserves existing behavior; no breaking change
- When
False — commit() wraps Aborted in RetryAborted and raises immediately, bypassing the statement-replay loop
Files changed
connection.py — Add retry_aborts_internally parameter to __init__ and connect(), add property getter/setter, modify commit() to check the flag
test_connection.py — 8 new unit tests
Usage
from google.cloud.spanner_dbapi import connect
# Default (unchanged) — internal retry enabled
conn = connect(instance_id, database_id, project=project)
# Disable internal retry for application-managed retries
conn = connect(instance_id, database_id, project=project,
retry_aborts_internally=False)
# SQLAlchemy via connect_args
engine = create_engine("spanner:///...",
connect_args={"retry_aborts_internally": False})
Production impact
In our workload (10 concurrent writers updating JSON array columns on the same row):
| Configuration |
Success rate |
Abort-to-recovery time |
| Default (nested retries) |
~55% |
13–19 seconds |
retry_aborts_internally=False + app retry |
98–100% |
0.01–0.08 seconds |
Related
Summary
The Spanner DBAPI layer (
spanner_dbapi) always retries aborted transactions internally by replaying all recorded statements and validating checksums. There is no way to disable this behavior. Applications that implement their own transaction retry logic (re-invoking a callable with a fresh session on abort) experience nested retry loops that cause severe contention amplification under concurrent writes.Background
When
commit()receives anAbortedexception from Spanner, the DBAPI enters an internal retry loop inTransactionRetryHelper.retry_transaction(). This loop replays all statements recorded during the transaction and validates checksums of read results to ensure consistency. It retries up to 50 times with exponential backoff.This mechanism was designed for Django and other PEP 249 ORMs that build transactions incrementally through individual
cursor.execute()calls (original motivation: googleapis/python-spanner-django#34). In this model, the DBAPI layer is the only component that can retry — the ORM has no concept of "re-run this transaction from scratch."However, many applications use a different pattern: wrapping the entire transaction in a callable and re-invoking it on abort (similar to
Session.run_in_transaction). For these applications, the internal retry is unnecessary and harmful.The nested retry problem
When an application wraps transactions in its own retry loop and the DBAPI also retries internally, the two layers interfere:
Contention amplification (thundering herd): The internal replay re-acquires locks on the same rows that caused the original abort. Under concurrent writes, each replay attempt can abort another thread's replay, leading to exponential retry growth across threads.
Wasted wall-clock time: The internal retry loop accumulates 13–19 seconds of lock wait time (observed in production with 10 concurrent writers) before finally raising
RetryAborted. The outer application retry then starts fresh, having wasted all that time.Checksum mismatches on contended rows: For read-modify-write patterns, replayed reads almost always return different data (because another transaction committed in between), causing
_compare_checksums()to fail. The internal retry is structurally unable to succeed in this scenario — it always falls through toRetryAbortedafter exhausting retries.Relevant code paths
connection.pyL505-515Connection.commit()Aborted, callsretry_transaction(), then recursively callscommit()transaction_helper.pyL165-210TransactionRetryHelper.retry_transaction()checksum.pyL64-80_compare_checksums()RetryAbortedon checksum mismatchexceptions.pyL165-172RetryAbortedTimeline
RETRY_ABORTS_INTERNALLYNewReadWriteStmtBasedTransaction(with internal retry) vsReadWriteTransaction(without) as separate APIsProposed Change
Add a
retry_aborts_internallyparameter toConnectionandconnect(), following the same pattern used forread_onlyandrequest_priority:True— preserves existing behavior; no breaking changeFalse—commit()wrapsAbortedinRetryAbortedand raises immediately, bypassing the statement-replay loopFiles changed
connection.py— Addretry_aborts_internallyparameter to__init__andconnect(), add property getter/setter, modifycommit()to check the flagtest_connection.py— 8 new unit testsUsage
Production impact
In our workload (10 concurrent writers updating JSON array columns on the same row):
retry_aborts_internally=False+ app retryRelated
RETRY_ABORTS_INTERNALLYNewReadWriteStmtBasedTransactionvsReadWriteTransaction