-
Type: Bug
-
Resolution: Done
-
Priority: Major - P3
-
Affects Version/s: 3.2.0
-
Component/s: Replication
-
Fully Compatible
-
ALL
-
Repl E (01/08/16)
-
0
ISSUE SUMMARY
In a replica set, if a secondary node is shut down while replicating writes, the node may mark certain replicated operations as successfully applied even though they have not.
This problem only applies to a “clean shutdown”, which occurs when the node is shut down via one of the following means:
- The shutdown command
- The Ctrl-C handler on Windows
- The following POSIX signals: TERM, HUP, INT, USR1, XCPU
Notably, this error does not apply to nodes that shut down abnormally. If a mongod process is ended due to a hard termination, such as via a KILL signal, it will not be subject to this bug.
USER IMPACT
If a secondary node is shut down while replicating writes, the node may end up in an inconsistent state with respect to the primary and other secondaries.
WORKAROUNDS
There are two workarounds for safely shutting down a secondary node running 3.2.0. They are described below.
Use a non-clean shutdown method
By inducing a non-clean shutdown, the bug can be avoided. This approach is safe on all deployments using WiredTiger, and all MMAP deployments with journaling enabled (the default).
On a system that supports posix signals, send a KILL (9) or QUIT (3) signal to the mongod process to shut it down. On Windows, use “tskill”. The storage engine and replication recovery code will bring the node back into a consistent state upon server restart.
This is a temporary workaround for 3.2.0 users. Do not use after upgrading to 3.2.1 or newer.
Remove the node from the replica set
Removing the node from its replica set configuration before shutting it down ensures that the node is not processing replicated writes at shutdown time.
Remove the node from the replica set configuration via the replSetReconfig command or rs.reconfig shell helper. Then, wait for the node to enter the REMOVED state before shutting it down.
AFFECTED VERSIONS
Only MongoDB 3.2.0 is affected by this issue.
FIX VERSION
The fix is included in the 3.2.1 production release.
Original description
In sync_tail.cc, multiApply() assumes the application always succeeds, then sets minValid to acknowledge that.
// This write will not journal/checkpoint.
setMinValid(&txn, {start, end});
lastWriteOpTime = multiApply(&txn, ops);
setNewTimestamp(lastWriteOpTime.getTimestamp());
setMinValid(&txn, end, DurableRequirement::None);
minValidBoundaries.start = {};
minValidBoundaries.end = end;
finalizer.record(lastWriteOpTime);
multiApply() delegates the work to applyOps(), which simply schedules the work to worker threads:
// Doles out all the work to the writer pool threads and waits for them to complete void applyOps(const std::vector<std::vector<BSONObj>>& writerVectors, OldThreadPool* writerPool, SyncTail::MultiSyncApplyFunc func, SyncTail* sync) { TimerHolder timer(&applyBatchStats); for (std::vector<std::vector<BSONObj>>::const_iterator it = writerVectors.begin(); it != writerVectors.end(); ++it) { if (!it->empty()) { writerPool->schedule(func, stdx::cref(*it), sync); } } }
However schedule() may return an error to indicate shutdown is already in progress. sync_tail.cpp ignores the error and continues to mark that operation finished.
If the shutdown happens after the schedule of operations, the secondary will run into another fassert, which is also unexpected. Restart cannot fix the inconsistent state either. This has also been observed in repeated runs of backup_restore.js
As a result, any kind of operations may be marked executed by mistake when shutting down the secondary, including commands and database operations, leading to an inconsistent state with the primary and potential missing/stale documents on secondaries.
To fix this issue, after the on_block_exit of the join call we need to check if shutdown is happened and return the empty optime to indicate the batch is not complete.