-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: Replication, Storage
-
Fully Compatible
-
ALL
-
Storage NYC 2018-05-07, Storage NYC 2018-05-21
-
63
It is possible for the listDatabases command to erroneously omit a database if the database contains a single collection and that collection is concurrently renamed. The problem stems from the fact that the listDatabases command takes a GlobalLock in MODE_IS. The renameCollection command acquires a GlobalLock in MODE_IX and a MODE_X database lock on the database on which it is performing the rename. Global locks of type IX and IS do not conflict, so the listDatabases command and renameCollection command are allowed to run concurrently.
When the renameCollection command executes a rename within the same database, it will call DatabaseImpl::renameCollection, which as part of the rename operation, will call KVDatabaseCatalogEntryBase::renameCollection and remove the entry for the source collection from KVDatabaseCatalogEntryBase::_collections. It will then insert the entry for the destination collection to the structure here, before it finishes. If there was only one collection in the database, the KVDatabaseCatalogEntryBase::_collections structure will be empty until the entry for the destination collection is added. If, during this period, a listDatabases command is running, it is possible that it will view the database object in this state and consider it to be empty. It checks for the "emptiness" of KVDatabaseCatalogEntryBase::_collections here, in KVStorageEngine::listDatabases. This can cause this database to be missed, even though it should exist.
This can be a problem internally, for example, for initial sync, which relies on the correctness of the results returned by the listDatabases command for its collection cloning process.
There is a repro attached demonstrating how the listDatabases command can produce incorrect results. There is also a repro attached demonstrating how this issue could lead to a collection missing on a node following initial sync. Running these tests on repeat for a few runs should produce the respective error cases.
- depends on
-
SERVER-34968 Running listDatabases command and renameCollection command concurrently on mobile storage engine can cause WriteConflict errors
- Closed
- related to
-
SERVER-34615 find by UUID can return NamespaceNotFound for a collection that is concurrently renamed
- Closed
-
SERVER-37552 Illegal concurrent access to KVDatabaseCatalogEntryBase::_collections
- Closed