ISSUE SUMMARY
When using the WiredTiger storage engine, a race condition may prevent locally committed documents from being immediately visible to subsequent read operations. This bug may have an impact on both server and application operations. Unless exposed by a replication problem, it is not possible to determine if a system has been impacted by this bug without significant downtime.
USER IMPACT
Normally, after a write is committed by the storage engine, it is immediately visible to subsequent operations. A race condition in WiredTiger may prevent a write from becoming immediately visible to subsequent operations, which may result in various problems, primarily impacting replication:
- User writes may not be immediately visible to subsequent read operations
- Replica set members may diverge and contain different data
- Replication thread(s) shut down server with error message “Fatal Assertion 16360”, due to duplicate _id values (a unique index violation)
Deployments where a WiredTiger node is or was used as a source of data may be affected. This includes:
- replica sets where the primary node is or was running WiredTiger
- replica sets using chained replication where any node may sync from a WiredTiger node
MMAPv1-only deployments are not affected by this issue. Mixed storage engine deployments are not affected when WiredTiger nodes never become primary, or when WiredTiger secondaries are not used as a source for chained replication.
WORKAROUNDS
There are no workarounds for this issue. All MongoDB 3.0 users running the WiredTiger storage engine should upgrade to MongoDB 3.0.8. A 3.0.8-rc0 release candidate containing the fix for this issue is available for download.
Users experiencing the "Fatal Assertion 16360" error may restart the affected node to fix the issue, but this condition may recur so upgrading to 3.0.8 is strongly recommended.
AFFECTED VERSIONS
MongoDB 3.0.0 through 3.0.7 using the WiredTiger storage engine. MongoDB 3.2.0 is not affected by this issue.
FIX VERSION
The fix is included in the 3.0.8 production release.
Original description
A new test is being introduced into the FSM tests to check the dbHash of the DB (and collections) on all replica set nodes, during these phases of the workload (SERVER-21115):
- Workload completed, before invoking teardown
- Workload completed, after invoking teardown
Before the dbHash is computed, cluster.awaitReplication() is invoked to ensure that all nodes in the replica set have caught up.
During the development of this test it was noticed that infrequent failures would occur for workload remove_and_bulk_insert, for wiredTiger storage.
- duplicates
-
WT-2237 Make committed changes visible immediately
- Closed
- is depended on by
-
SERVER-21115 Add dbHash checking to concurrency suite
- Closed
- is duplicated by
-
SERVER-21778 slave node crash: writer worker caught exception: E11000 duplicate key error
- Closed
- is related to
-
SERVER-21237 ReplSetTest.prototype.awaitReplication reads directly from the oplog collection causing false positives
- Closed
-
SERVER-21645 WiredTigerRecordStore::temp_cappedTruncateAfter should set _oplog_highestSeen
- Closed
- related to
-
SERVER-21847 log range of operations read from sync source during replication
- Closed