-
Type: Task
-
Resolution: Gone away
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
Sharding NYC
The problem is related to SERVER-70850 and SERVER-70487. The targeted scenario is when the chunk migration is interrupted by simultaneous donor and config server step down (read catalog shard). In this case the recipient will be stuck waiting for the interruption from critical section.
The donor responsibility is to send _configsvrEnsureChunkVersionIsGreaterThan to the config server and then _recvChunkReleaseCritSec to the recipient. Until the recipient is interrupted with this _recvChunkReleaseCritSec command it will remain deadlocked with config server's Balancer starting and waiting for all move chunk participants to exit the critical section, while the recipient will not exit the critical section until something tells it. This may prevent the Balancer to be stuck in the init() method for minutes.
The attached fix is not super clean. It amends the PeriodicShardedIndexConsistencyChecker::onStepUp() the following waiy:
if (serverGlobalParams.clusterRole == ClusterRole::CatalogShard &&
_shardedIndexConsistencyChecker.isValid()) {
...
_shardedIndexConsistencyChecker.stop();
_shardedIndexConsistencyChecker.detach();
...
}
The reason is that the Index consistency checker is scanning all collections and generates the StaleConfigInfo on collection which shard is stuck in critical section. Then this error will make the _recvChunkReleaseCritSec to be sent to the recipient. I think there should be a cleaner way to force the check to generate the StaleConfigInfo. The one I did works but is not clean.