The ReshardingOplogFetcher uses ShardRemote::runAggregation() to run resharding's oplog fetching pipeline. ShardRemote::runAggregation() uses the Fetcher class to schedule the remote network requests. Fetcher::join() doesn't wait using an OperationContext so it continues to block even after the node steps down. For as long as the network request continues to run on the remote node, the the still-active Instance will prevent PrimaryOnlyService::onStepUp() and the overall step-up procedure from completing.
We should instead have Fetcher::join() wait using an OperationContext so the Fetcher::~Fetcher() destructor can abandon waiting for the remote network request.
- is related to
-
SERVER-60859 ReshardingCoordinator waits on _canEnterCritical future without cancellation, potentially preventing config server primary step-up from ever completing
- Closed
-
SERVER-61633 Resharding's RecipientStateMachine doesn't join thread pool for ReshardingOplogFetcher, leading to server crash at shutdown
- Closed