Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-48198

Migration recovery may recover incorrect decision after shard key refine

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 4.4.0-rc7, 4.7.0
    • Affects Version/s: None
    • Component/s: Sharding
    • None
    • Fully Compatible
    • ALL
    • v4.4
    • Sharding 2020-05-18
    • 28

      When a new primary steps up in a shardsvr replica set, it launches a task to recover any migrations driven by the node's shard that were in-progress when the previous primary stepped down. As part of this, the recovery process will recover the outcome of each migration by loading the latest metadata from the config server and checking if the minimum bound from the migration still belongs to the donor shard. If it does, the migration is assumed to have aborted, and the recovery process updates the persisted range deleter state on the donor and recipient shards so any orphans on either are deleted.

      If an interrupted migration committed successfully and its namespace had its shard key refined before the recovery process runs, the check for ownership will use the pre-refine minimum boundary but a post-refine routing table. This may result in a spurious overlap, which leads the recovery process to incorrectly decide the migration aborted, preventing any orphans on the donor from being cleaned up. The recipient will attempt to schedule a range deletion for the received range, which will fail with RangeOverlapConflict.

      To fix this, the recovery process should extend the migration's min bound when performing the ownership check if the most recent shard key has more fields, like what was done inĀ SERVER-46386.

            Assignee:
            jack.mulrow@mongodb.com Jack Mulrow
            Reporter:
            jack.mulrow@mongodb.com Jack Mulrow
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: