Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-46386

Refining a shard key may lead to an orphan range never being cleaned up

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 4.4.0-rc0, 4.7.0
    • Affects Version/s: None
    • Component/s: Sharding
    • None
    • Fully Compatible
    • ALL
    • v4.4
    • Hide

      Apply server42192.patch to allow the moveChunk command to be automatically retried in the presence of failovers and run the random_moveChunk_refine_collection_shard_key.js FSM workload. The --repeat is necessary because while this concurrency test reproduces the issue often, it doesn't happen every time.

      python buildscripts/resmoke.py --suite=concurrency_sharded_multi_stmt_txn_terminate_primary jstests/concurrency/fsm_workloads/random_moveChunk_refine_collection_shard_key.js --repeat=10
      
      Show
      Apply server42192.patch to allow the moveChunk command to be automatically retried in the presence of failovers and run the random_moveChunk_refine_collection_shard_key.js FSM workload. The --repeat is necessary because while this concurrency test reproduces the issue often, it doesn't happen every time. python buildscripts/resmoke.py --suite=concurrency_sharded_multi_stmt_txn_terminate_primary jstests/concurrency/fsm_workloads/random_moveChunk_refine_collection_shard_key.js --repeat=10
    • Sharding 2020-03-09, Sharding 2020-03-23

      The config.rangeDeletions collection stores a document of the following form to track range deletion tasks needing to be performed:

      {
          "_id" : UUID("78447b8a-84d6-4555-a97d-b4bc2d708e29"),
          "nss" : "test4_fsmdb0.fsmcoll0_29",
          "collectionUuid" : UUID("1eafdb97-2114-4b72-a6ce-34e96aa3df1a"),
          "donorShardId" : "shard-rs1",
          "range" : {
              "min" : {
                  "a" : 400
              },
              "max" : {
                  "a" : { "$maxKey" : 1 }
              }
          },
          "whenToClean" : "delayed"
      }
      

      The "range" field of this document is left unmodified after a user has successfully run the refineCollectionShardKey command. This means if the following sequence of events occurs, then a "RangeOverlapConflict: Requested deletion range overlaps a live shard chunk" error will prevent the range deleter from ever deleting the range of orphan documents.

      1. The test4_fsmdb0.fsmcoll0_29 collection is sharded on {a: 1} and has chunks
        • {a: MinKey} -> {a: 100}
        • {a: 100} -> {a: 200}
        • {a: 200} -> {a: 300}
        • {a: 300} -> {a: 400}
        • {a: 400} -> {a: MaxKey}
      2. A migration begins for chunk {a: 400} -> {a: MaxKey} from shard-rs1 to shard-rs0.
      3. Migration completes successfully but the primary of shard-rs1 steps down before the range deletion task completes.
      4. User runs the refineCollectionShardKey and changes the shard key of the test4_fsmdb0.fsmcoll0_29 collection to {a: 1, b: 1}. The chunks are therefore augmented to be
        • {a: MinKey, b: MinKey} -> {a: 100, b: MinKey}
        • {a: 100, b: MinKey} -> {a: 200, b: MinKey}
        • {a: 200, b: MinKey} -> {a: 300, b: MinKey}
        • {a: 300, b: MinKey} -> {a: 400, b: MinKey}
        • {a: 400, b: MinKey} -> {a: MaxKey, b: MaxKey}
      5. The newly elected primary of shard-rs1 attempts to schedule the range deletion task still present in its config.rangeDeletions collection. Since the {a: 300, b: MinKey} -> {a: 400, b: MinKey} range is considered to partially overlap with the range {a: 400} -> {a: MaxKey} because {a: 400} < {a: 400, b: MinKey}, if shard-rs1 happens to own that range then it'll refuse to perform the range deletion for {a: 400, b: MinKey} -> {a: MaxKey, b: MaxKey}.

            Assignee:
            matthew.saltz@mongodb.com Matthew Saltz (Inactive)
            Reporter:
            max.hirschhorn@mongodb.com Max Hirschhorn
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: