Rerunning commitTransaction, with the recoveryToken added in SERVER-37344, on a new mongos blocks forever. It also seems to get the cluster into a state where it cannot accept any writes (even to other databases) but the shard still reports itself as the primary. Also, both the shard server and config server do not shutdown normally and need to be killed with SIGKILL.
To reproduce start a sharded cluster with at least two mongoses (my cluster a one config server and a one node shard). Run the repro script: reproHangingCommit.js
$ mongo reproHangingCommit.js MongoDB shell version v4.0.1 connecting to: mongodb://127.0.0.1:27017 MongoDB server version: 4.1.7 WARNING: shell and server versions do not match Starting transaction on mongos #1: { "insert" : "test", "documents" : [ { "_id" : ObjectId("5c4a55e0542fbbcc137ad1cd") } ], "lsid" : { "id" : UUID("6f579bae-6919-4e07-ac80-fe056861b2b9") }, "txnNumber" : NumberLong(1), "autocommit" : false, "startTransaction" : true } Commit transaction on mongos #1: { "commitTransaction" : 1, "lsid" : { "id" : UUID("6f579bae-6919-4e07-ac80-fe056861b2b9") }, "txnNumber" : NumberLong(1), "autocommit" : false, "recoveryToken" : { "shardId" : "demo-set-0" } } Commit transaction on mongos #2: { "commitTransaction" : 1, "lsid" : { "id" : UUID("6f579bae-6919-4e07-ac80-fe056861b2b9") }, "txnNumber" : NumberLong(1), "autocommit" : false, "recoveryToken" : { "shardId" : "demo-set-0" } } // Hangs forever waiting for the commit on mongos #2
db.currentOp() reports an ongoing coordinateCommitTransaction command that never ends. I've attached an example currentOp output at the bottom of the repro script.
- is related to
-
SERVER-39349 Recovering the state of a completed single-shard transaction should not block
- Closed
- related to
-
SERVER-37344 Implement recovery token for retrying a commit command on a different mongos
- Closed