As part of SERVER-65016 a new code to remove the range deletion document was added as an optimization into the existing drop collection code, however, this code is using an alternative client region in order to remove multiple documents, this is done because the shardsvr_drop_collection_participant uses the retryable write machinery to guard against replay protection.
The unintended effect of this is that a thread that is dropping a collection will first checkout a session, and then, as part of taking the collection lock it will try to grab the RSTL lock when executing the DBClient command to remove the range deletion documents. If a stepdown sneaks in after the session is checked out, then the stepdown thread will grab the RSTL lock and then try to checkout and kill all running sessions, causing a deadlock.
In the attached stacktrace log this situation can be seen between the Thread 2 and Thread 99. One way to solve this is to do create the operation context the same way the rename collection metadata command does, which is, linking the new operation context created in the alternative region to the parent cancellation token, this way, during the stepdown, when the parent operation context is interrupted, the thread waiting for the lock will finish, liberating the session, allowing the shutdown thread to effectively checking it out.
- is caused by
-
SERVER-65016 Remove range deletions as part of `dropCollection`
- Closed
- related to
-
SERVER-60161 Deadlock between config server stepdown and _configsvrRenameCollectionMetadata command
- Closed