First, at least the refreshFilteringMetadataUntilSuccess loop is racy when used to test a stepdown while hanging in the failpoint in the loop, because the failpoint causes the loop to enter an interruptible sleep. The sleep is interruptible because an OperationContext is passed. Since the OperationContext was used to take strong locks as part of forceShardFilteringMetadataRefresh (all the AutoGetDb/AutoGetCollection in here), the OperationContext gets interrupted by the stepdown, and immediately enters the catch block (even before the failpoint is turned off). The race is that the stepdown may not have updated the memberState and term yet, so this assertion passes and loop starts a second iteration, rather than failing on the first iteration.
If the above race happens and the loop starts a second iteration, then the migration_coordinator_failover.js test "accidentally" passes if the same node is elected primary because of a bug in the ShardServerCatalogCacheLoader (SERVER-45646). This bug causes the forceShardFilteringMetadataRefresh in the second iteration of the loop to throw NetworkInterfaceExceededTimeLimit, and therefore the catch block is entered again and checks the assertion again, this time after the member state has been updated.
We can fix this by avoiding the first race by making the failpoint use an uninterruptible sleep when being used to pause the thread in order to induce a stepdown.
- is depended on by
-
SERVER-44771 Allow operations in transactions to safely consult the CatalogCache on mongod
- Closed