Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 4.9.0
Affects Version/s: None
Component/s: Sharding
Labels:
- Sharding-EMEA
- sharding-wfbf-day

Backwards Compatibility:
Fully Compatible
Linked BF Score:
9
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

If there is any network error when the moveChunk receiver communicates with the config server, the operation fails after hanging for 30 seconds (startCommit timeout == timeout before retrying a failed network request).

Detailed explanation

In the moveChunk flow - on the receiver side - the migrateThread is calling MigrationDestinationManager::_migrateDriver in order to perform the necessary steps. After that, it notifies the _isActiveCV condition variable on which startCommit waits for a maximum of 30 seconds.

After each MigrationDestinationManager::_migrateDriver's step, the state is logged on the CSRS through the MoveTimingHelper that calls into the ShardingLogger to insert a config document. As highlighted in ~~SERVER-51397~~, if a network partition happens during a CatalogClient request, the first retry happens after 30 seconds (too late because the startCommit timeout is exactly 30 seconds).

Assignee:: Pierlauro Sciarelli
Reporter:: Pierlauro Sciarelli
Participants:: Githook User, Pierlauro Sciarelli
Votes:: 0 Vote for this issue
Watchers:: 1 Start watching this issue

Created:: Feb 05 2021 01:51:46 PM UTC
Updated:: Oct 29 2023 09:57:54 PM UTC
Resolved:: Mar 18 2021 07:50:41 PM UTC
Confidence Status Last Update:: 18/Mar/21 4:45 PM

Details

Description

Detailed explanation

Attachments

Activity

People

Dates