Summary: It's possible for a resharding refresh to hang on a recipient if the recipient acquires the critical section between the time flushReshardingStateChanges acquires locks, checks the critical section, releases the locks, and calls onShardVersionMismatch
Scenario of recipient0 causing a hang:
- recipient1 errors, transitions from steady-state to error
[js_test:resharding_fails_on_nonempty_stash] d20026| 2021-05-04T00:56:36.146+00:00 I RESHARD 5279506 [ReshardingRecipientService-2] "Transitioned resharding recipient state","attr":{"newState":"error","oldState":"steady-state","namespace":"reshardingDb.coll","collectionUUID":{"uuid":{"$uuid":"23d3b803-bbd6-4766-beea-80d7d50e884a"}},"reshardingUUID":{"uuid":{"$uuid":"43ab79c0-6627-4c12-ac92-a4b315575c00"}}}
- the coordinator transitions to error
[js_test:resharding_fails_on_nonempty_stash] c20028| 2021-05-04T00:56:36.159+00:00 I RESHARD 5343001 [ReshardingCoordinatorService-1] "Transitioned resharding coordinator state","attr":{"newState":"error","oldState":"blocking-writes","namespace":"reshardingDb.coll","collectionUUID":{"uuid":{"$uuid":"23d3b803-bbd6-4766-beea-80d7d50e884a"}},"reshardingUUID":{"uuid":{"$uuid":"43ab79c0-6627-4c12-ac92-a4b315575c00"}}}
- recipient0 transitions to strict-consistency after flushReshardingStateChanges acquires the locks to check the critical section and releases them but before flushReshardingStateChanges calls onShardVersionMismatch
[js_test:resharding_fails_on_nonempty_stash] d20024| 2021-05-04T00:56:36.160+00:00 I RESHARD 5279506 [ReshardingRecipientService-0] "Transitioned resharding recipient state","attr":{"newState":"strict-consistency","oldState":"steady-state","namespace":"reshardingDb.coll","collectionUUID":{"uuid":{"$uuid":"23d3b803-bbd6-4766-beea-80d7d50e884a"}},"reshardingUUID":{"uuid":{"$uuid":"43ab79c0-6627-4c12-ac92-a4b315575c00"}}}
- is duplicated by
-
SERVER-55852 Shards first acquire LockManager locks before reacting to abortReshardCollection
- Closed
- is related to
-
SERVER-56612 Use the resharding-specific refresh function when recovering a resharding operation in the drain mode
- Closed
- related to
-
SERVER-57953 _flushReshardingStateChange attempts to refresh shard version while another refresh already pending, leading to invariant failure
- Closed