Summary
The WriteConflictHelpers used by the transaction_write_conflicts.js test do not assure that all transaction it has created are aborted/comitted. This means that its possible for the transaction-related locks to never be released, which in the case of the associated BF called the validate_collections.js hook to hang forever.
This issue only occurs in multi-shard sharded collections as in those cases its possible that a transaction can start in one shard but not another. Causing there to be a difference in which transactions are "present" and able to be aborted/committed.
Context
The validate_collections.js test is a hook that runs after the completion of a running test. In this case transaction_write_conflicts.js.
The latter test utilizes the T1StartsFirstAndWins and T2StartsFirstAndWins to define tests that given two separate operations, assure that the correct one is prevented from taking effect and returns a WriteConflict. These helpers are defined in write_conflicts.js
The Sequence
- The test runs the T1StartsFirstAndWins test case under the "multidelete-multiupdate" section . The writeConflictTest helper will create the session used to create the failing transaction
- Using the created session/txn information, the provided operation gets executed. Which fails with a WriteConflict error on Shard 1 and causes the transaction to be aborted on shard 1 L-2368. This is the first abortTransaction command Shard 0 receives.
- The test helper commits the transaction, which causes the MongoS to send the commit transaction command to the pertinent shards, which fails with NoSuchTransaction as Shard 0 hasn't started the transaction yet. The command expected to fail with NoSuchTransaction because of the abort caused by the WriteConflict. But it also fails because Shard 0 hasn't started the transaction yet.
- The failure of the commitTransaction results in MongoS implicitly aborting the transaction to all pertinent shards. This is the second abortTransaction command Shard 0 receives.
- Shard 0 starts the transaction L-2402. But since it was started after all of the aborts/commits, it will last until the transactionLifetimeLimitSeconds limit. Which in the test environments is 24 hours.
Proposed Solution
At the end of T1StartsFirstAndWins and T2StartsSecondAndWins. Add the following logic:
- Start a new transaction with the session associated with the failing transaction
- Send a no-op command to all of the shards (or alternatively all shards that didn't have the WriteConflict error)
- Commit the transaction started in step 1.
Relevant Logs
[ShardedClusterFixture:job2:mongos] 2021-08-25T22:33:46.489+0000 D3 TXN [conn97] 93007dc5-7a2c-456a-8241-92fef7de25c5:0 Implicitly aborting transaction on 2 shard(s) due to error: WriteConflict: Encountered error from localhost:20503 during a transaction :: caused by :: WriteConflict [ShardedClusterFixture:job2:shard1:primary] 2021-08-25T22:33:46.483+0000 D4 TXN [conn51] New transaction started with txnNumber: 0 on session with lsid 93007dc5-7a2c-456a-8241-92fef7de25c5 [ShardedClusterFixture:job2:shard1:primary] 2021-08-25T22:33:46.487+0000 I TXN [conn51] transaction parameters:{ lsid: { id: UUID("93007dc5-7a2c-456a-8241-92fef7de25c5"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 0, autocommit: false, readConcern: { level: "snapshot", afterClusterTime: Timestamp(1629930826, 89) } }, readTimestamp:Timestamp(0, 0), terminationCause:aborted timeActiveMicros:4799 timeInactiveMicros:21 numYields:0 locks:{ ReplicationStateTransition: { acquireCount: { w: 6 } }, Global: { acquireCount: { r: 3, w: 2 } }, Database: { acquireCount: { r: 2, w: 2 } }, Collection: { acquireCount: { w: 2 } }, Mutex: { acquireCount: { r: 6 } }, oplog: { acquireCount: { r: 2 } } } storage:{} wasPrepared:0, 4ms [ShardedClusterFixture:job2:mongos] 2021-08-25T22:33:46.481+0000 D3 TXN [conn97] 93007dc5-7a2c-456a-8241-92fef7de25c5:0 New transaction started [ShardedClusterFixture:job2:mongos] 2021-08-25T22:33:46.489+0000 D3 TXN [conn97] 93007dc5-7a2c-456a-8241-92fef7de25c5:0 Implicitly aborting transaction on 2 shard(s) due to error: WriteConflict: Encountered error from localhost:20503 during a transaction :: caused by :: WriteConflict [ShardedClusterFixture:job2:mongos] 2021-08-25T22:33:46.496+0000 I TXN [conn97] transaction parameters:{ lsid: { id: UUID("93007dc5-7a2c-456a-8241-92fef7de25c5"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 0, autocommit: false, readConcern: { afterClusterTime: Timestamp(1629930826, 89) } }, numParticipants:2, terminationCause:aborted, abortCause:WriteConflict, timeActiveMicros:15122, timeInactiveMicros:0, 15ms [ShardedClusterFixture:job2:mongos] 2021-08-25T22:33:46.512+0000 D3 TXN [conn97] 93007dc5-7a2c-456a-8241-92fef7de25c5:0 Implicitly aborting transaction on 2 shard(s) due to error: NoSuchTransaction: 93007dc5-7a2c-456a-8241-92fef7de25c5:0 Failed to commit transaction because a previous statement on the transaction participant shard-rs0 was unsuccessful. [ShardedClusterFixture:job2:shard0:primary] 2021-08-25T22:33:46.517+0000 D4 TXN [conn94] New transaction started with txnNumber: 0 on session with lsid 93007dc5-7a2c-456a-8241-92fef7de25c5 [ShardedClusterFixture:job2:shard0:primary] 2021-08-25T22:33:46.517+0000 D3 TXN [conn94] Inserting coordinator 93007dc5-7a2c-456a-8241-92fef7de25c5:0 into in-memory catalog