-
Type: Bug
-
Resolution: Done
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Sharding, Testing Infrastructure
-
ALL
-
Sharding 2022-05-02
-
3
The changes from 50caf49 as part of SERVER-62377 added the following logic to the withTxnAndAutoRetry() helper function. Setting hasTransientError causes a new transaction to be started. However, FailedToSatisfyReadPreference doesn't guarantee the commitTransaction command wasn't sent because mongos may have retried the commitTransaction command internally due a retryable error and then encountered the FailedToSatisfyReadPreference error.
// FailedToSatisfyReadPreference errors are not retryable. // However, they should be because if there is no primary, there should be one soon. // TODO SERVER-60706: Make FailedToSatisfyReadPreference a transient error if (e.code == ErrorCodes.FailedToSatisfyReadPreference) { print("Retrying transaction due to a FailedToSatisfyReadPreference error."); hasTransientError = true; continue; }
The withTxnAndAutoRetry() helper function must retry only the commitTransaction command to learn the definitive result of the transaction. Upon seeing an error labeled with TransientTransactionError may the withTxnAndAutoRetry() helper function retry the entire transaction. Retrying the commitTransaction command on FailedToSatisfyReadPreference is perhaps best done inside commitTransactionWithKilledSessionRetries().
This has been observed to cause a spurious duplicate key error due to the mongo shell when running the create_and_drop_collection.js FSM workload in the concurrency_sharded_multi_stmt_txn_with_stepdowns Evergreen task.
[j0:s0] | 2022-03-25T16:07:09.162+00:00 D3 EXECUTOR 22607 [conn1685] "Scheduling remote command request","attr":{"request":"RemoteCommand 25962 -- target:[localhost:20003] db:test18_fsmdb0 cmd:{ insert: \"fsmcoll0\", bypassDocumentValidation: false, ordered: true, documents: [ { _id: 14.0 } ], shardVersion: [ Timestamp(0, 0), ObjectId('000000000000000000000000'), Timestamp(0, 0) ], databaseVersion: { uuid: UUID(\"f2af1ae1-5f1a-41eb-b2ab-40f6eb21b128\"), timestamp: Timestamp(1648224429, 9), lastMod: 1 }, lsid: { id: UUID(\"f5ff2e85-d579-4529-919c-675912543218\"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 33, readConcern: { level: \"snapshot\", atClusterTime: Timestamp(1648224429, 9) }, startTransaction: true, coordinator: true, autocommit: false }"} [j0:s0] | 2022-03-25T16:07:09.181+00:00 D3 EXECUTOR 22607 [conn1685] "Scheduling remote command request","attr":{"request":"RemoteCommand 25964 -- target:[localhost:20003] db:admin cmd:{ commitTransaction: 1, writeConcern: { w: \"majority\", wtimeout: 300321, provenance: \"clientSupplied\" }, coordinator: true, autocommit: false, txnNumber: 33, lsid: { id: UUID(\"f5ff2e85-d579-4529-919c-675912543218\"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) } }"} ... [j0:s0] | 2022-03-25T16:07:09.194+00:00 D2 ASIO 22597 [conn1685] "Request finished with response","attr":{"requestId":25964,"isOK":true,"response":"{ readOnly: false, writeConcernError: { code: 11602, codeName: \"InterruptedDueToReplStateChange\", errmsg: \"operation was interrupted\", errInfo: { writeConcern: { w: \"majority\", wtimeout: 300321, provenance: \"clientSupplied\" } } }, ok: 1.0, topologyVersion: { processId: ObjectId('623de6b9393bc43269e4007c'), counter: 95 }, lastCommittedOpTime: Timestamp(1648224429, 11), $clusterTime: { clusterTime: Timestamp(1648224429, 13), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } }, $configTime: Timestamp(1648224429, 10), $topologyTime: Timestamp(1648223955, 17), operationTime: Timestamp(1648224429, 12) }"} ... [j0:s0] | 2022-03-25T16:07:24.195+00:00 D1 ASSERT 23074 [conn1685] "User assertion","attr":{"error":"FailedToSatisfyReadPreference: Could not find host matching read preference { mode: \"primary\" } for set shard-rs0","file":"src/mongo/s/transaction_router.cpp","line":207} [j0:s0] | 2022-03-25T16:07:24.195+00:00 D3 TXN 22897 [conn1685] "Implicitly aborting transaction on all participant shards","attr":{"sessionId":{"id":{"$uuid":"f5ff2e85-d579-4529-919c-675912543218"},"uid":{"$binary":{"base64":"47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=","subType":"0"}}},"txnNumber":33,"txnRetryCounter":0,"numParticipantShards":1,"error":"FailedToSatisfyReadPreference: Could not find host matching read preference { mode: \"primary\" } for set shard-rs0"} ... [j0:s0] | 2022-03-25T16:07:31.747+00:00 D3 EXECUTOR 22607 [conn1685] "Scheduling remote command request","attr":{"request":"RemoteCommand 26007 -- target:[localhost:20004] db:test18_fsmdb0 cmd:{ insert: \"fsmcoll0\", bypassDocumentValidation: false, ordered: true, documents: [ { _id: 14.0 } ], shardVersion: [ Timestamp(0, 0), ObjectId('000000000000000000000000'), Timestamp(0, 0) ], databaseVersion: { uuid: UUID(\"f2af1ae1-5f1a-41eb-b2ab-40f6eb21b128\"), timestamp: Timestamp(1648224429, 9), lastMod: 1 }, lsid: { id: UUID(\"f5ff2e85-d579-4529-919c-675912543218\"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 34, readConcern: { level: \"snapshot\", atClusterTime: Timestamp(1648224451, 2) }, startTransaction: true, coordinator: true, autocommit: false }"} [fsm_workload_test:create_and_drop_collection] [tid:0] Retrying transaction due to a FailedToSatisfyReadPreference error. ... [fsm_workload_test:create_and_drop_collection] uncaught exception: Error: 1 thread with tids [0] threw [fsm_workload_test:create_and_drop_collection] "errmsg" : "E11000 duplicate key error collection: test18_fsmdb0.fsmcoll0 index: _id_ dup key: { _id: 14.0 } found value: RecordId(1)", [fsm_workload_test:create_and_drop_collection] "op" : { [fsm_workload_test:create_and_drop_collection] "_id" : 14 [fsm_workload_test:create_and_drop_collection] }
- is caused by
-
SERVER-62377 withTxnAndAutoRetry() must retry commitTransaction command on FailedToSatisfyReadPreference error response
- Closed