-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: Replication, Sharding
-
None
-
Fully Compatible
-
ALL
-
v4.4, v4.2, v4.0
-
Repl 2021-04-19, Repl 2021-05-03, Repl 2021-05-17
-
70
SessionUpdateTracker::_updateSessionInfo() is used by secondary oplog application to coalesce multiple updates to the same config.transactions record into a single update of the most recent retryable write statement. The changes from 02020fa as part of SERVER-47844 made it possible for a secondary to choose its stable_timestamp as a majority-committed timestamp from within an oplog batch rather than always being on a batch boundary. The combination of these two can lead to the following sequence:
- During single batch of oplog application:
- User data write for stmtId=0 at t=10.
- User data write for stmtId=1 at t=11.
- User data write for stmtId=2 at t=12.
- Session txn record write at t=12 with stmtId=2 as lastWriteOpTime.
- In particular, no session txn record write for t=10 with stmtId=0 as lastWriteOpTime or for t=11 with stmtId=1 as lastWriteOpTime because they were coalseced by the SessionUpdateTracker.
- Rollback to stable timestamp t=10.
- The session txn record won't exist with stmtId=0 as lastWriteOpTime (because the write was entirely skipped by oplog application) despite the user data write for stmtId=0 being reflected on-disk. This allows stmtId=0 to be re-executed by this node if it became primary.
Impact on 4.0, 4.2, and 4.4 branches
The stable optime candidates list prevents this issue for retryable inserts, updates, and deletes applied during secondary oplog application.
However, retryable inserts on primaries also coalesce multiple updates to the same config.transactions record into a single update of the most recent retryable write statement. This happens through OpObserverImpl::onInserts() calling TransasctionParticipant::onWriteOpCompletedOnPrimary() once for a batch of insert statements (aka vectored insert).
- Only retryable inserts are impacted.
- A retry attempt fails with a DuplicateKey error so long as the document wasn't deleted by another client in the meantime. (The document is re-inserted otherwise.)
Impact on 4.9 and master branches
The stable optime candidates list was removed and so this issue exists for retryable inserts, updates, and deletes applied during secondary oplog application. Retryable inserts on primaries continue to coalesce multiple updates to the same config.transactions record into a single update of the most recent retryable write statement.
- All of retryable inserts, updates, and deletes are impacted.
- A retry attempt for an update can execute more than once (e.g. double increment a counter).
This issue was discovered while reasoning through why the atClusterTime read on config.transactions to fix SERVER-54626 was insufficient (hence SERVER-55214). Shout out to daniel.gottlieb for the assist!
- is related to
-
SERVER-54626 Retryable writes may execute more than once in resharding if statements straddle the fetchTimestamp
- Closed
-
SERVER-55214 Resharding txn cloner can miss config.transactions entry when fetching
- Closed
-
SERVER-56631 Retryable write pre-fetch phase could miss entry from config.transactions when reading from donor secondaries
- Closed
-
SERVER-56796 Support atClusterTime snapshot reads on config.transactions
- Backlog
-
SERVER-47844 Update _setStableTimestampForStorage to set the stable timestamp without using the stable optime candidates set when EMRC=true
- Closed
-
SERVER-47845 Remove obsolete code related to storing and updating stable optime candidates
- Closed
- related to
-
SERVER-55578 Disallow atClusterTime reads on the config.transactions collection
- Closed