ISSUE DESCRIPTION AND IMPACT
WiredTiger's rollback to stable (RTS) process runs at startup time to remove from page images any writes that occurred after the node's stable timestamp. Because of this bug in MongoDB 4.4.5, the RTS process can corrupt page metadata, causing documents on affected pages to become invisible to MongoDB. Any startup can trigger the bug, including the initial upgrade to MongoDB 4.4.5. Possible outcomes are:
- Most likely, a fatal error and inability to restart due to duplicate key exceptions.
- Temporary query incorrectness, if a crash does not occur.
Fatal error and inability to restart (most likely)
A fatal error and crash occurs during replication oplog recovery at startup, or immediately after the node enters the secondary state.
Operations stored in the replication oplog that are re-applied to affected pages tend to be incompatible with the current state of data. For example, an update to an invisible document becomes an upsert that collides with an invisible document's key in the unique _id index, leading to a duplicate key exception.
Temporary query incorrectness
If a crash does not occur, documents on affected pages remain invisible temporarily. Normally the set of potentially impacted page images is limited to those pages that were evicted from memory just before the last checkpoint before the shutdown, but a lagging majority commit point across the cluster can widen this set.
Importantly: Depending on how the application responds to missing documents, any query correctness issue can lead to logical data corruption.
It is probable that no user intervention is required for affected pages to be evicted and reloaded into memory, correcting the issue.
DIAGNOSIS AND AFFECTED VERSIONS
This bug affects MongoDB version 4.4.5 only.
Any nodes running MongoDB version 4.4.5 can be affected on any restart. Impacted nodes are most likely to crash with a "Caught exception during replication recovery" message on startup, as in the following:
{"t":{...},"s":"F","c":"REPL","id":21570,"ctx":"initandlisten","msg":"Caught exception during replication recovery","attr": {"error":{"code":11000,"codeName":"DuplicateKey","errmsg":"E11000 duplicate key error collection: ... index: _id_ ...","keyPattern":{"_id":1},"keyValue":{...}}}} {"t":{...},"s":"F", "c":"CONTROL", "id":4757800, "ctx":"initandlisten","msg":"Writing fatal message","attr":{"message":"terminate() called. An exception is active; attempting to gather more information"}} {"t":{...},"s":"F", "c":"CONTROL", "id":4757800,"ctx":"initandlisten","msg":"Writing fatal message","attr":{"message":"DBException::toString(): DuplicateKey ..."}}
It is also possible for an impacted node to crash with a "Writer worker caught exception" after entering secondary status, such as:
{"t":{"$date":"2021-04-15T17:07:56.402+00:00"},"s":"F", "c":"REPL", "id":21238, "ctx":"ReplWriterWorker-6","msg":"Writer worker caught exception","attr":{"error":"DuplicateKey{ keyPattern: { _id: 1 } ...","oplogEntry":{...}}}
If a node does successfully start, user applications may throw errors due to missing documents, and the node may log non-fatal errors like "Erroneous index key found with reference to non-existent record id" when trying to access the document.
REMEDIATION AND WORKAROUNDS
The fix is included in the 4.4.6 production release. If a node has crashed and cannot be restarted without error, the most straightforward remediation is to restart the node on 4.4.6 and upgrade the rest of the cluster.
To remediate the issue on 4.4.5, re-sync the impacted node.
Importantly, restarting the node on 4.4.5 is not a remediation to query correctness issues, because any restart can trigger the bug.
Original description
Wiredtiger transaction ids persisted to the disk should be reset to 0 after database restart.
It relies on wiredtiger checking the page's write generation against the connection level base write generation. If the page's write generation is smaller, we should clear the transaction ids on that page.
We only update the connection level base write generation after we have done rollback to stable so that during rollback to stable we haven't cleared the transaction ids.
The issue is that if we create a new disk image during rollback to stable and we are not writing it to disk (i.e. update restore eviction or in-memory page split), we neither clear the transaction ids on that page nor update its page write generation (the page write generation is initialized as 0).
Since the page is still in memory, after we have updated the connection level base write generation, if we read this page again, we will not clear the transaction ids of the page because its page write generation is 0, which causes the data consistency problem.
- causes
-
WT-7481 Fix the wrong assert of disk image write gen comparison with btree base write gen
- Closed
- is caused by
-
WT-6673 RTS fix inconsistent checkpoint by removing updates outside of the checkpoint snapshot
- Closed
- is depended on by
-
SERVER-54301 Add correctness tests of recovery using history store
- Closed
- is related to
-
SERVER-56154 Invariant failure in !needsRenaming || allowRenameOutOfTheWay
- Closed
-
SERVER-56463 MongoDB cannot start after stop and reboot host
- Closed
- related to
-
WT-13349 Improve testing for history store crash recovery
- Open
-
WT-11168 Remove the page image reuse logic
- Closed