Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Major - P3
Fix Version/s: 3.2.1, 3.3.0
Affects Version/s: 3.2.0
Component/s: Replication
Labels:
- code-only

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Completed:

3.2.1
Sprint:
Repl E (01/08/16)
Linked BF Score:
0
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

Issue Status as of Dec 16, 2015

ISSUE SUMMARY
In a replica set, if a secondary node is shut down while replicating writes, the node may mark certain replicated operations as successfully applied even though they have not.

This problem only applies to a “clean shutdown”, which occurs when the node is shut down via one of the following means:

The shutdown command
The Ctrl-C handler on Windows
The following POSIX signals: TERM, HUP, INT, USR1, XCPU

Notably, this error does not apply to nodes that shut down abnormally. If a mongod process is ended due to a hard termination, such as via a KILL signal, it will not be subject to this bug.

USER IMPACT
If a secondary node is shut down while replicating writes, the node may end up in an inconsistent state with respect to the primary and other secondaries.

WORKAROUNDS
There are two workarounds for safely shutting down a secondary node running 3.2.0. They are described below.

Use a non-clean shutdown method

By inducing a non-clean shutdown, the bug can be avoided. This approach is safe on all deployments using WiredTiger, and all MMAP deployments with journaling enabled (the default).

On a system that supports posix signals, send a KILL (9) or QUIT (3) signal to the mongod process to shut it down. On Windows, use “tskill”. The storage engine and replication recovery code will bring the node back into a consistent state upon server restart.

This is a temporary workaround for 3.2.0 users. Do not use after upgrading to 3.2.1 or newer.

Remove the node from the replica set

Removing the node from its replica set configuration before shutting it down ensures that the node is not processing replicated writes at shutdown time.

Remove the node from the replica set configuration via the replSetReconfig command or rs.reconfig shell helper. Then, wait for the node to enter the REMOVED state before shutting it down.

AFFECTED VERSIONS
Only MongoDB 3.2.0 is affected by this issue.

FIX VERSION
The fix is included in the 3.2.1 production release.

Original description

In sync_tail.cc, multiApply() assumes the application always succeeds, then sets minValid to acknowledge that.

        // This write will not journal/checkpoint.
        setMinValid(&txn, {start, end});

        lastWriteOpTime = multiApply(&txn, ops);
        setNewTimestamp(lastWriteOpTime.getTimestamp());

        setMinValid(&txn, end, DurableRequirement::None);
        minValidBoundaries.start = {};
        minValidBoundaries.end = end;
        finalizer.record(lastWriteOpTime);

multiApply() delegates the work to applyOps(), which simply schedules the work to worker threads:

// Doles out all the work to the writer pool threads and waits for them to complete
void applyOps(const std::vector<std::vector<BSONObj>>& writerVectors,
              OldThreadPool* writerPool,
              SyncTail::MultiSyncApplyFunc func,
              SyncTail* sync) {
    TimerHolder timer(&applyBatchStats);
    for (std::vector<std::vector<BSONObj>>::const_iterator it = writerVectors.begin();
         it != writerVectors.end();
         ++it) {
        if (!it->empty()) {
            writerPool->schedule(func, stdx::cref(*it), sync);
        }
    }
}

However schedule() may return an error to indicate shutdown is already in progress. sync_tail.cpp ignores the error and continues to mark that operation finished.

If the shutdown happens after the schedule of operations, the secondary will run into another fassert, which is also unexpected. Restart cannot fix the inconsistent state either. This has also been observed in repeated runs of backup_restore.js

As a result, any kind of operations may be marked executed by mistake when shutting down the secondary, including commands and database operations, leading to an inconsistent state with the primary and potential missing/stale documents on secondaries.

To fix this issue, after the on_block_exit of the join call we need to check if shutdown is happened and return the empty optime to indicate the batch is not complete.

Assignee:: Siyuan Zhou
Reporter:: Siyuan Zhou
Participants:: Githook User, Siyuan Zhou
Votes:: 0 Vote for this issue
Watchers:: 26 Start watching this issue

Created:: Dec 11 2015 05:14:28 PM UTC
Updated:: Jan 25 2017 09:59:32 PM UTC
Resolved:: Dec 17 2015 07:29:44 PM UTC

Details

Description

Use a non-clean shutdown method

Remove the node from the replica set

Original description

Attachments

Activity

People

Dates