Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 6.3.0-rc0, 6.0.5, 5.0.16
Affects Version/s: 5.0.13, 6.0.2, 6.1.0-rc4, 6.2.0-rc0
Component/s: None
Labels:
- sharding-wfbf-day

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v6.2, v6.0, v5.0
Steps To Reproduce:
Hide

Migration starts on current primary node P0 of the donor shard.

Commit of the migration fail due to network error

_cleanup() is executed and as part of it endMetadataOp() will persist the latest config time. This is useless since we still don't have knowledge of the config time that is inclusive of the migration commit.

Asynchronous recovery of the migration is spawned

During recovery we read again from the config server and we realized that actually the commit succeed

We will call completeMigration() and we will persist the migration decision in the coordinator document without calling endMetadataOp().

Stepdown will happen before removing the coordinator document.

A new primary node P1 of the donor shard will be elected and it will try to recover the migration again since the coordinator document is still present.

This time it will find that the migration decision have been already set as kCommitted and it will call again completeMigration() without calling endMetadataOp()

Since the configOpTime known to P1 is not inclusive of the latest committed migration, there is no guarantee that any subsequent refresh of the filtering metadata would include the committed migration.
Show
Migration starts on current primary node P0 of the donor shard. Commit of the migration fail due to network error _cleanup() is executed and as part of it endMetadataOp() will persist the latest config time. This is useless since we still don't have knowledge of the config time that is inclusive of the migration commit. Asynchronous recovery of the migration is spawned During recovery we read again from the config server and we realized that actually the commit succeed We will call completeMigration() and we will persist the migration decision in the coordinator document without calling endMetadataOp() . Stepdown will happen before removing the coordinator document. A new primary node P1 of the donor shard will be elected and it will try to recover the migration again since the coordinator document is still present. This time it will find that the migration decision have been already set as kCommitted and it will call again completeMigration() without calling endMetadataOp() Since the configOpTime known to P1 is not inclusive of the latest committed migration, there is no guarantee that any subsequent refresh of the filtering metadata would include the committed migration.
Sprint:
Sharding EMEA 2022-12-12, Sharding EMEA 2022-12-26, Sharding EMEA 2023-01-09
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

The donor of a chunk migration calls ShardingStateRecovery::endMetadataOp() that is persisting the configOpTime inclusive of the migration commit, this is to ensure that in case of stepdown when the next primary node will read from the config server it will see the effect of the commit performed by the previous primary.

The problem is that endMetadataOp() is not called after recovering a failed migration, so in case the donor experiences an error during the commit (network error) and a subsequent stepdown, there is no guarantee that the next primary node will install the correct filtering metadata inclusive of the last migration.

The proposed solution is to add a VectorClock::waitForDurableConfigTime() just before writing down the commit decision in the migration coordinator document.
This will be execute both if no error occur during the commit as well as during migration recovery.

Assignee:: Tommaso Tocci
Reporter:: Tommaso Tocci
Participants:: Githook User, Tommaso Tocci
Votes:: 0 Vote for this issue
Watchers:: 4 Start watching this issue

Created:: Nov 14 2022 04:59:03 PM UTC
Updated:: Oct 29 2023 09:30:37 PM UTC
Resolved:: Jan 06 2023 01:04:29 AM UTC
Confidence Status Last Update:: 07/Dec/22 1:22 PM

Details

Description

Attachments

Activity

People

Dates