ISSUE DESCRIPTION AND AFFECTED VERSIONS
This issue in MongoDB 4.4.8 causes a checkpoint thread to read and persist an incomplete version of data to disk. Data in memory remains correct unless the server crashes or experiences an unclean shutdown. Then, the inconsistent checkpoint is used for recovery and introduces corruption.
The bug is triggered on cache pages that receive an update during a running checkpoint and which are evicted during the checkpoint.
DIAGNOSIS AND IMPACT
MongoDB 4.4.8 is affected. The issue is fixed in version 4.4.9.
The bug can cause a Duplicate Key error on startup and prevent the node from starting.
The validate command reveals impact by reporting on the inconsistencies created between documents and indexes, in the form of:
- extra index entries (including duplicate entries in unique indexes)
- missing index entries
After an unclean shutdown, inconsistent writes can lead to the inability to restart an impacted node due to a Duplicate Key error during startup. However, nodes can also start successfully and still be impacted.
If a node starts successfully, it may still have been impacted by:
- Data inconsistency within documents - specific field values may not correctly reflect writes that were acknowledged to the application prior to the unclean shutdown time.
- Incomplete query results - lost or inaccurate index entries may cause incomplete query results for queries that use impacted indexes.
- Missing documents - documents may be lost on impacted nodes.
REMEDIATION AND WORKAROUNDS
First, upgrade to a fixed version (MongoDB 4.4.9). Impact can be remediated on earlier versions but could re-occur.
Then, run the validate command on each collection on each node of your replica set.
If validate reports any failures, resync the impacted node from an unaffected node. If an unaffected node cannot be readily identified these scripts can assist the remediation of this bug.
Original description
I’ve been working backwards from checkpoint skipping a page it shouldn’t when running the test case in WT-7958. Here is what I am seeing:
- page P exists on disk with address A and is clean
- checkpoint starts running
- page P is modified, setting first_dirty_txn ahead of the checkpoint
- eviction chooses P to evict (in some tree ahead of the checkpoint)
- eviction reconciles P
- the main part of reconciliation succeeds but __rec_hs_wrapup fails with EBUSY (there are various checks in __wt_hs_insert_updates when checkpoint_running == true, I’m not sure exactly which one is failing)
- at this point, ref->addr == NULL && mod->rec_result == 0 and the block for A has been freed, the page is dirty but first_dirty_txn is ahead of the checkpoint
- checkpoint skips writing P, and when it writes P’s parent, it considers P, sees the missing address and takes the WT_CHILD_IGNORE path — i.e., nothing is written and the original content of P (from step 1) is missing from the checkpoint
Note that nothing is lost in memory, so the next checkpoint (including a clean shutdown) will write P and fill in the hole.
It looks like reordering __rec_write_wrapup to call __rec_hs_wrapup before it clears out the address will fix this, I’m just checking if there are any problems with doing that.
- is duplicated by
-
SERVER-60371 Fatal assertion - msgid 34437 - DuplicateKey
- Closed
- is related to
-
SERVER-60371 Fatal assertion - msgid 34437 - DuplicateKey
- Closed
-
WT-7958 Include recovery in test/checkpoint
- Closed
- related to
-
WT-7958 Include recovery in test/checkpoint
- Closed