-
Type: Task
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: Sharding
-
Fully Compatible
-
9
If there is any network error when the moveChunk receiver communicates with the config server, the operation fails after hanging for 30 seconds (startCommit timeout == timeout before retrying a failed network request).
Detailed explanation
In the moveChunk flow - on the receiver side - the migrateThread is calling MigrationDestinationManager::_migrateDriver in order to perform the necessary steps. After that, it notifies the _isActiveCV condition variable on which startCommit waits for a maximum of 30 seconds.
After each MigrationDestinationManager::_migrateDriver's step, the state is logged on the CSRS through the MoveTimingHelper that calls into the ShardingLogger to insert a config document. As highlighted in SERVER-51397, if a network partition happens during a CatalogClient request, the first retry happens after 30 seconds (too late because the startCommit timeout is exactly 30 seconds).