Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 4.2.18, 4.4.10, 5.0.4, 5.1.0-rc0
Affects Version/s: None
Component/s: Sharding
Labels:
None

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v5.0, v4.4, v4.2
Linked BF Score:
48
Confidence Status:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

Summary

The WriteConflictHelpers used by the transaction_write_conflicts.js test do not assure that all transaction it has created are aborted/comitted. This means that its possible for the transaction-related locks to never be released, which in the case of the associated BF called the validate_collections.js hook to hang forever.

This issue only occurs in multi-shard sharded collections as in those cases its possible that a transaction can start in one shard but not another. Causing there to be a difference in which transactions are "present" and able to be aborted/committed.

Context

The validate_collections.js test is a hook that runs after the completion of a running test. In this case transaction_write_conflicts.js.

The latter test utilizes the T1StartsFirstAndWins and T2StartsFirstAndWins to define tests that given two separate operations, assure that the correct one is prevented from taking effect and returns a WriteConflict. These helpers are defined in write_conflicts.js

The Sequence

The test runs the T1StartsFirstAndWins test case under the "multidelete-multiupdate" section . The writeConflictTest helper will create the session used to create the failing transaction
Using the created session/txn information, the provided operation gets executed. Which fails with a WriteConflict error on Shard 1 and causes the transaction to be aborted on shard 1 L-2368. This is the first abortTransaction command Shard 0 receives.
The test helper commits the transaction, which causes the MongoS to send the commit transaction command to the pertinent shards, which fails with NoSuchTransaction as Shard 0 hasn't started the transaction yet. The command expected to fail with NoSuchTransaction because of the abort caused by the WriteConflict. But it also fails because Shard 0 hasn't started the transaction yet.
The failure of the commitTransaction results in MongoS implicitly aborting the transaction to all pertinent shards. This is the second abortTransaction command Shard 0 receives.
Shard 0 starts the transaction L-2402. But since it was started after all of the aborts/commits, it will last until the transactionLifetimeLimitSeconds limit. Which in the test environments is 24 hours.

Proposed Solution
At the end of T1StartsFirstAndWins and T2StartsSecondAndWins. Add the following logic:

Start a new transaction with the session associated with the failing transaction
Send a no-op command to all of the shards (or alternatively all shards that didn't have the WriteConflict error)
Commit the transaction started in step 1.

Relevant Logs

[ShardedClusterFixture:job2:mongos] 2021-08-25T22:33:46.489+0000 D3 TXN      [conn97] 93007dc5-7a2c-456a-8241-92fef7de25c5:0 Implicitly aborting transaction on 2 shard(s) due to error: WriteConflict: Encountered error from localhost:20503 during a transaction :: caused by :: WriteConflict
[ShardedClusterFixture:job2:shard1:primary] 2021-08-25T22:33:46.483+0000 D4 TXN      [conn51] New transaction started with txnNumber: 0 on session with lsid 93007dc5-7a2c-456a-8241-92fef7de25c5
[ShardedClusterFixture:job2:shard1:primary] 2021-08-25T22:33:46.487+0000 I  TXN      [conn51] transaction parameters:{ lsid: { id: UUID("93007dc5-7a2c-456a-8241-92fef7de25c5"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 0, autocommit: false, readConcern: { level: "snapshot", afterClusterTime: Timestamp(1629930826, 89) } }, readTimestamp:Timestamp(0, 0), terminationCause:aborted timeActiveMicros:4799 timeInactiveMicros:21 numYields:0 locks:{ ReplicationStateTransition: { acquireCount: { w: 6 } }, Global: { acquireCount: { r: 3, w: 2 } }, Database: { acquireCount: { r: 2, w: 2 } }, Collection: { acquireCount: { w: 2 } }, Mutex: { acquireCount: { r: 6 } }, oplog: { acquireCount: { r: 2 } } } storage:{} wasPrepared:0, 4ms
[ShardedClusterFixture:job2:mongos] 2021-08-25T22:33:46.481+0000 D3 TXN      [conn97] 93007dc5-7a2c-456a-8241-92fef7de25c5:0 New transaction started
 [ShardedClusterFixture:job2:mongos] 2021-08-25T22:33:46.489+0000 D3 TXN      [conn97] 93007dc5-7a2c-456a-8241-92fef7de25c5:0 Implicitly aborting transaction on 2 shard(s) due to error: WriteConflict: Encountered error from localhost:20503 during a transaction :: caused by :: WriteConflict
[ShardedClusterFixture:job2:mongos] 2021-08-25T22:33:46.496+0000 I  TXN      [conn97] transaction parameters:{ lsid: { id: UUID("93007dc5-7a2c-456a-8241-92fef7de25c5"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 0, autocommit: false, readConcern: { afterClusterTime: Timestamp(1629930826, 89) } }, numParticipants:2, terminationCause:aborted, abortCause:WriteConflict, timeActiveMicros:15122, timeInactiveMicros:0, 15ms
[ShardedClusterFixture:job2:mongos] 2021-08-25T22:33:46.512+0000 D3 TXN      [conn97] 93007dc5-7a2c-456a-8241-92fef7de25c5:0 Implicitly aborting transaction on 2 shard(s) due to error: NoSuchTransaction: 93007dc5-7a2c-456a-8241-92fef7de25c5:0 Failed to commit transaction because a previous statement on the transaction participant shard-rs0 was unsuccessful.
[ShardedClusterFixture:job2:shard0:primary] 2021-08-25T22:33:46.517+0000 D4 TXN      [conn94] New transaction started with txnNumber: 0 on session with lsid 93007dc5-7a2c-456a-8241-92fef7de25c5
[ShardedClusterFixture:job2:shard0:primary] 2021-08-25T22:33:46.517+0000 D3 TXN      [conn94] Inserting coordinator 93007dc5-7a2c-456a-8241-92fef7de25c5:0 into in-memory catalog

Assignee:: Luis Osta (Inactive)

Reporter:: Luis Osta (Inactive)

Participants:: Githook User, Luis Osta, Vivian Ge

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: Sep 13 2021 07:00:47 PM UTC

Updated:: Oct 29 2023 09:48:40 PM UTC

Resolved:: Oct 01 2021 02:57:17 PM UTC

Confidence Status Last Update:: 01/Oct/21 2:57 PM

GA Target Date:: None

Public Preview Target Date:: None

Private Preview Target Date:: None

Experiment Target Date:: None

Details

Description

Attachments

Activity

People

Dates