-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: Sharding
-
Fully Compatible
-
ALL
-
Sharding 2019-03-11
-
20
While running a sharded transaction, the router tracks in-memory each shard that has been involved in the transaction. When a new shard is targeted by a statement, the router adds it to the participant list as a pending participant and attaches startTransaction=true and the active transaction's txnId to the next request sent to that shard.
The shard is considered "pending" because of the shard version version protocol, which means the router won't know the shard was able to satisfy the request sent to it until it gets an OK response. To handle the case where a pending participant returns a stale version error, the router will abort the active txnId on each pending participant (not just the one that returned the error), wait for each to respond, then retry the current statement, possibly targeting a different set of shards. This allows the router to handle these errors within a transaction, instead of returning a transient error and making the client retry with a new txnId.
To enable this behavior, shards can accept a request with startTransaction=true more than once for a txnId, but only if the shard's local transaction is in the aborted state. This relies on the abort sent to every pending participant before retrying reaching each one after the first requests with startTransaction=true, which is not guaranteed in an asynchronous network. If the abort arrives before the first request on a shard and that shard is targeted by the retry, the retry will be rejected by that shard because it will have an in-progress transaction at that txnId.
- related to
-
SERVER-39704 Allow mongos to retry on stale version and snapshot errors within a transaction
- Backlog