Concurrent removeShard and movePrimary may end up with an undesired delete of unsharded collections.
Bug description
Imagine the following scenario:
- There are 2 shards: 'shard0', 'shard1'
- Database 'myDB' primary shard is 'shard0'
- Collection 'myDB.collA' is unsharded, so it's located in 'shard0'
At some point, someone decides to call concurrently these commands:
- { removeShard:'shard1' }
- { movePrimary:'myDB', to: 'shard1'}
.
Then, if the sequence of the internal executions are the written below, the cluster will end up with an undesired deletion of all the unsharded collections of 'myDB'.
1. removeShard command is called to the config server
2. The config server, following the removeShard thread, checks if the unsharded databases count on the shard is zero. As it's true, the process continues.
3. After that point, the movePrimary is performed, which means that all the unsharded collections are moved to 'shard1'.
4. The removeShard commit phase starts and 'shard1' is removed from the topology of the cluster.
Small note to understand better the 2nd bullet: the removeShard command returns a non completed status if the shard still have unsharded databases and notifies the user that those should be moved explicitly using movePrimary. A better explanation can be found here.
- is related to
-
SERVER-69890 Concurrent movePrimary and removeShard can move database to a no-longer existent shard
- Closed