Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-67530

Loss of shard RS primary can lead to a loss of read availability for a collection after failed migration

    • Type: Icon: Bug Bug
    • Resolution: Won't Fix
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 5.0.3, 6.1.0-rc0
    • Component/s: None
    • Sharding EMEA
    • ALL
    • Sharding 2022-06-27

      If a shard's replica set loses its primary node and is unable to elect a new new in the inopportune moment where a failed migration needs to be recovered, then the collection will not be available on that shard.

      This is because the recipient will clear the filtering metadata when an error occurs. Complete migration will also not succeed since it can't reach the primary of the recipient node, which means that the migration coordinator document will not be deleted. This means that when new requests comes in for the collection, shard recovery will get triggerred because the filtering metadata was cleared earlier. The recovery process will discover that the migration coordinator document is still around and will try to perform complete migration again, but will get timed out trying to look for primary from the RSM.

        1. test_v6.1.js
          1 kB
          Randolph Tan
        2. test.js
          2 kB
          Randolph Tan

            Assignee:
            backlog-server-sharding-emea [DO NOT USE] Backlog - Sharding EMEA
            Reporter:
            randolph@mongodb.com Randolph Tan
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

              Created:
              Updated:
              Resolved: