ISSUE DESCRIPTION
This issue may, in rare instances, cause documents or indexes to be stored out of order in a collection or index in MongoDB versions 4.2.0-4.2.23, 4.4.0-4.4.18, 5.0.0-5.0.14, 6.0.0-6.0.4 and rapid release versions 6.1.0-6.2.0. Once an entry is out of order, subsequent operations can make incorrect decisions that lead to further data corruption.
The issue is in the insert skip list data structure that stores newly inserted values before they are written to disk, and results in a value being inserted in the wrong position within a sorted set. The wrongly sorted values will usually be persisted to disk because the content of the data structure is directly translated to the on-disk format without re-checking for correct ordering in release builds.
The insert skip list data structure expects concurrent access and is implemented without locking. The issue was related to the search implementation in the skip list, which is used to find the location to insert a new entry and look for existing content. Searching in a skip list involves moving through a sequence of comparisons. It is necessary that each movement compares against something as new or newer than the prior comparison. WiredTiger did not explicitly note that requirement in the search code. We found via internal testing that modern ARM CPUs were speculatively executing the comparisons in a way that caused a violation of the ordering requirement.
The skip list implementation has an optimization that avoids re-doing comparisons against a common prefix based on the result of prior comparisons. If the out-of-order comparison causes a comparison to an older entry than expected and that entry happened to have a matching suffix but a different prefix, the search could return the wrong result. If the search is in preparation for an insert - that result would be a location further into the skip list than expected, leading to out-of-order insertion.
Any user who has ever run MongoDB on a system architecture with weak memory ordering (ARM64, POWER) is susceptible to the issue.
DIAGNOSIS AND IMPACT
In builds of WiredTiger with HAVE_DIAGNOSTIC defined, this would be detected when the skip list is being written to disk - since additional order checking is done in diagnostic mode. This is how the issue was originally identified via internal test programs.
If using WiredTiger without diagnostic checking, the issue could be identified by running the WT_SESSION::verify command.
For affected MongoDB users, the bug can manifest on uncorrupted data when memory reordering causes an operation on the WiredTiger skiplist to read an inconsistent state, corresponding to partially observing a concurrent insert. This can cause out of order keys to be persisted to the record store or to an index.
Once out of order keys are first persisted, the following diagnostic signatures are most likely:
- If a document insert introduced out of order keys to the record store:
- validate() reports "DataCorruptionDetected: WT_Cursor::next – returned out-of-order keys" in the “errors” field, or "structural damage"
- If a document insert or update introduced out of order keys to an index:
- Incorrectly sorted query results for queries using the affected index(es)
- validate() reports "index '<index_name>' is not in strictly ascending or descending order" in the “errors” field, or "structural damage"
However, once these out of order keys are persisted, there is a higher chance of observing any of the following issues:
- Index inconsistencies, or incomplete index builds involving a previously affected collection.
- Improperly-sorted or incomplete query results on affected nodes due to index key inconsistencies in affected index(es).
- Write conflict exceptions and indefinite operation hangs when out of order documents or index entries violate operation constraints.
- Replica set inconsistencies between nodes if the source of an initial sync operation is impacted and the clone phase is impacted by incomplete query results.
- Missing, or out-of-order events in change streams.
- False positives for unique index key collisions (E11000) against affected unique indexes.
- Logical inconsistencies introduced by an application in response to improperly-sorted or incomplete query results.
If a collection or index is impacted, the validate command (with the {{full }}option) includes any of the following failures:
- Incorrectly sorted keys, suggested by errors like:
- "Structural damage"
- "Out-of-order keys"
- “Index … is not in strictly ascending or descending order”
- Index key inconsistencies, suggested by errors like:
- "Extra index entries"
- "Missing index entries"
- "Unique index … has duplicate key"
The issue may also emit any of these messages in the mongod log:
- WTCursor::next – next was not greater than last
- WT_Cursor::next – returned out-of-order keys
- Erroneous index key found
REMEDIATION AND WORKAROUNDS
This issue is only possible on hardware that provides weak memory ordering, such as ARM and POWER . Note that x86 is not susceptible to this issue.
For MongoDB users, the remediation is to first upgrade to MongoDB version 4.2.24, 4.4.19, 5.0.15, 6.0.5, or rapid release version 6.2.1. Once this upgrade is completed, perform the following steps:
- Review the validate command documentation for important considerations.
- Perform rolling maintenance on each replica set node. On each offline node, run:
- validate({full: true}) on each collection.
- The validate.js script is available to run validate iteratively on multiple databases/collections.
- After validating all nodes, if any node has failed validation, follow this replica set consistency checking process to identify or rule out replica set inconsistencies between nodes of the cluster.
Original description
A WiredTiger split-stress job failed with the following error running on ARM64:
[2023/01/04 13:43:58.030] [1672839838:30196][13147:0xffff963cd100], file:test1, eviction-server: [WT_VERB_DEFAULT][ERROR]: __verify_row_key_order_check, 370: the 31 and 33 keys on page at [write-check] are incorrectly sorted: 000000000000000000000000000000000000000000000000000000043568734\00, 000000000000000000000000000000000000000000000000000000043568725\00 [2023/01/04 13:43:58.030] [1672839838:30429][13147:0xffff963cd100], eviction-server: [WT_VERB_DEFAULT][ERROR]: __wt_evict_thread_run, 324: cache eviction thread error: WT_ERROR: non-specific WiredTiger error [2023/01/04 13:43:58.030] [1672839838:30447][13147:0xffff963cd100], eviction-server: [WT_VERB_DEFAULT][ERROR]: __wt_evict_thread_run, 324: the process must exit and restart: WT_PANIC: WiredTiger library panic [2023/01/04 13:43:58.030] [1672839838:30455][13147:0xffff963cd100], eviction-server: [WT_VERB_DEFAULT][ERROR]: __wt_abort, 28: aborting WiredTiger library [2023/01/04 13:43:58.135] Aborted (core dumped)
That failure is similar to the one encountered in WT-9058, which we eventually put down because we could not reproduce.
The patch build was running with some additional diagnostic code to trigger a suspected failure related to co-ordinating exclusive access to pages. The diagnostic code involved adding information to the session handle to track historical events - barriers were used to try to ensure the code was executed as expected. When running a patch build with those changes, it encountered the above error.
- causes
-
WT-10624 Fix regression on x86 for search and insert
- Closed
- is related to
-
WT-10568 Fix missing read barriers in column store insert list search
- Closed
-
WT-9764 Enhance diagnostics when out of order keys are detected
- Closed
-
WT-4909 Fix potential code caching/re-ordering issue in __search_insert_append
- Closed
- related to
-
WT-10584 Add missing read barriers in __cursor_skip_prev
- Closed
-
WT-10544 Review potential places that need a barrier for ARM
- Closed
-
WT-10561 Create a standalone reproducer for WT-10461
- Closed
-
WT-11049 Always check prefix skipped search results in diagnostic builds
- Closed
-
WT-9058 Key out of order detected on arm64
- Closed
-
WT-11948 Document the intricacies of the skip list ordering bug and its solution
- Closed
- links to