The AsyncRequestSender can retry a remote command request on retriable errors up to kMaxNumFailedHostRetryAttempts times. Based on BF-17754, it is not safe for the ARS to retry remote commands that are run inside transactions since it can lead to a crash in the following case:
- There is an unprepared transaction that gets aborted by the RstlKillOpThread on stepdown and the command fails with InterruptedDueToReplStateChange.
- The ARS retries the remote command request for that transaction statement (with startTransaction: true) against the new primary and the transaction gets committed with two-phase commit.
- The transaction state in the old primary’s TransactionParticipant is still AbortedWithoutPrepare when the oplog entry for prepare gets replicated to the old primary so the transaction state assertion here fails and the op applier fails this fatal assertion.
So the ARS should check to see if the operation with startTransaction=true and not retry if it is. There is already a rough CR patch for testing this case.
- is related to
-
SERVER-47645 Must invalidate all sessions on step down
- Closed
- related to
-
SERVER-88289 Remove manual check that skips retrying requests with startTransaction in ARS
- Backlog