-
Type: Bug
-
Resolution: Done
-
Priority: Critical - P2
-
Affects Version/s: None
-
Component/s: Replication, Write Ops
-
None
-
Fully Compatible
-
ALL
-
v3.6
-
Repl 2017-10-02, Repl 2017-10-23, Repl 2017-11-13, Repl 2017-12-04, Repl 2017-12-18, Query 2018-03-12, Query 2018-03-26, Query 2018-04-09, Query 2018-04-23
-
68
Consider the following sequence of events during an batch insert of 1000 documents with ordered:true and w:majority writeConcern.
- Insert 500 documents and unlock
- Pause the inserting thread
- Another node steps up and the original primary rolls back the 500 writes already done
- The original primary steps back up
- The inserting thread then does the remaining writes which get new optimes
- That thread then waits for majority confirmation of the last writes, and successfully returns to the user
In this case we've lost 500 writes that are w:majority confirmed, and we've written later ops without the earlier ops even with ordered:true. This is caused by a combination of not killing all ops (at least all writing ops) on all replSet stepdown paths, not closing all connections, and always asking "can I currently write to this namespace" rather than "have I always been able to write to this namespace since starting this op".
This issue also effects any operations that write multiple oplog entries with a release of the global lock in between, and "no-op" ops that get the last optime after releasing the global lock. A non-exhaustive list:
- All batch write operations (insert, update, delete)
- Multi-update and Multi-delete
- Agg with $out
- MapReduce
Potential solutions:
- Fail all write ops and waitForWriteConcern if the electionId (or rbid) changed since the op began
- Interrupt all write ops (or all ops) on all stepdown paths. Also need to either:
a) Ensure all write ops check for interrupt every time they aquire the global lock after acquiring it (currently they check first)
b) Make all lock acquisitions checkForInterrupt (this is planned already to support interruptable locking) - Record the term at the beginning of every operation, in the logOp (and awaitReplication) code check that the term of the write matches what was recorded and abort the write if not.
- causes
-
SERVER-34682 Old primary should vote yes and store the last vote after stepdown on learning of a higher term
- Closed
- is related to
-
SERVER-38354 Allow shutdown error when reading last applied optime on startup
- Closed
-
SERVER-31277 Cancel all user operations on heartbeat stepdown path
- Closed
-
SERVER-27545 Include RBID in replSet metadata of command replies
- Closed
- related to
-
SERVER-34672 Unable to add shard on 3.7.5 sharded cluster with mmapv1 shard
- Closed
-
SERVER-37574 Force reconfig should kill user operations
- Closed
-
SERVER-68874 Consider making waitAfterPinningCursorBeforeGetMoreBatch only hang instead of also fiddling with locks (while-loop taking and releasing locks)
- Closed
-
SERVER-37381 Allow prepared transactions to survive state transitions
- Closed