Loading...

XML

Word

Printable

JSON

...and shard primaries should read the configOpTime and wait for it to become durable on stepup, before resuming coordinating commits.

Otherwise, a new coordinator primary might assume a participant shard has received the decision, even though it hasn't. Example:

Coordinator Primary 1 (P1) sends prepare to Shard A
Shard A enters prepare
P1 steps down, Coordinator Primary 2 (P2) resumes the coordination and refreshes its ShardRegistry from a stale config server which doesn't have Shard A
P2 gets ShardNotFound for Shard A and treats the ShardNotFound as an ack.
Shard A remains in prepare forever.

The suggested fix implementation is to:

1) make coordinators update the configOpTime in the minOpTimeRecovery document along with writing the participant list, here. (The coordinator then waits for writeConcern, which would cover both writes).

is related to

SERVER-50146 Removing a shard with 'uncommitted' documents in config.rangeDeletions on migration recipient can lead to incomplete state on donor

related to

SERVER-53005 Complete TODO listed in SERVER-38918

Assignee:: [DO NOT USE] Backlog - Sharding Team
Reporter:: Esha Maharishi (Inactive)
Participants:: [DO NOT USE] Backlog - Sharding Team, Esha Maharishi, Kaloian Manassiev
Votes:: 0 Vote for this issue
Watchers:: 5 Start watching this issue