-
Type: Task
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: Testing Infrastructure
-
Fully Compatible
-
v4.0, v3.6
-
STM 2018-12-31
-
0
-
1
Problem
In the concurrency_simultaneous_replication test suite, we run 10 operations in parallel on the same database, there's a small chance (e.g. 5% for some workloads) that an operation could be a dropDatabase. For slower build variants, a single dropDatabase command can take multiple minutes to finish if there is heavy activity from other workloads that are happening in parallel.
Our tests will retry an operation for up to 10 minutes if DatabaseDropPending errors are encountered. After seeing the error, A getLastError command is used to wait for the dropDatabase command to be committed.
There is variability in the order that getLastError returns from different workload clients, which may cause certain workload clients to always be stuck behind other clients that are doing more dropDatabase commands. When this happens, the client will receive another DatabaseDropPending error. But the client is unable to distinguish whether the error is caused by the same dropDatabase command or a new one, causing the new wait to continue eat into the 10 minute timeout. There is a small probability that this cycle will happen for a handful of times in a row, which when combined with slow multi-minute dropDatabase commands, will exceed the 10 minute timeout.
Solution
The solution is to avoid retrying dropDatabase commands when it returns a DatabaseDropPending error. This will cause the workload to transition to a new state and continue to do so until the new state is no longer a dropDatabase call. Then it will wait on the ongoing dropDatabase call.
When the database is finally dropped, it's guaranteed that none of the clients waiting on it would be another drop database, so they should all be able to proceed. There might be edge cases where one client is able to execute multiple commands and one of those commands is another dropDatabase, but the likelihood of this happening 5 times in a row should be much smaller if not negligible.
From a correctness perspective, this change will make some dropDatabase implicitly into no-ops, which should not cause loss of test coverage, as databases can't be dropped in parallel in the first place. The tests that run parallel dropDatabases also all randomized tests and don't expect these operations to all succeed when there are parallel clients operating on the same database.
We should also write a dedicated regression test that does a high number of collection DDL operations while dropping and creating databases to simulate the timeout failures we've seen, the changes from this ticket should prevent the test from failing.
The new test and the changes to not retry dropDatabase should be limited to affect only the concurrency_simultaneous_replication suite, as we have not seen this failure elsewhere so far.