-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: Replication, Storage
-
None
-
Fully Compatible
-
ALL
-
v4.0
-
Storage NYC 2018-06-04
-
0
Collection drops happen in two writes. The first write renames the collection to a "drop pending" namespace. This write must be timestamped with the optime of the collection drop. The second write is to remove the collection from the catalog. This write must happen after the rename has become majority committed. This write that removes from the catalog must not be timestamped[1].
Commands being processed via oplog application use a timestamp block that sets a timestamp at commit time.
[1] KVStorageEngine::dropDatabase has a mechanism to disable this timestamp for non-replicated collections, but not for replicated collections. This was introduced back when it was a goal to appropriately timestamp collection drops. However, table drops are not transactional, which means dropped tables do not come back after a crash, even if the last checkpoint was taken before the table was dropped. If the write that removes a collection from the catalog is timestamped, there's a possibility that this write will not become "stable" when performing clean shutdown. Because the table is already dropped, the data files on disk would be inconsistent; a collection will exist without its backing table.