Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-78498

Make the balancer failpoint smarter

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 7.1.0-rc0, 7.0.1
    • Affects Version/s: None
    • Component/s: None
    • Sharding EMEA
    • Fully Compatible
    • v7.0, v6.0, v5.0, v4.4
    • Sharding EMEA 2023-08-21
    • 2

      When using the balancerShouldReturnRandomMigrations and overrideBalanceRoundInterval failpoints to induce chunk migrations, it seems that the balancer becomes overzealous and starts too many migrations. This actually results in many of the migrations colliding and getting cancelled, therefore reducing our test coverage of chunk migrations.

      Consider this example: we have a 2 shard cluster with 2 sharded collections with the above failpoints set. I noticed that 99% percent of chunk migrations were failing, due to the following pattern:
      0. Balancer enqueues requests chunk migration requests to both shards.
      1. Balancer sends request to Shard 0 to move chunk of nssA to Shard 1.
      2. Balancer sends request to Shard 1 to move chunk of nssB to Shard 0.
      3. Shard 0 receives the request. Prints "Starting chunk migration donation"
      4. Shard 1 receives the request. Prints "Starting chunk migration donation"
      5. Shard 0 rejects Shard 1's attempt to donate a chunk ("Rejecting receive chunk due to conflicting donate chunk in progress"), since it is trying to donate nssA's chunk to Shard 1.
      6. Shard 1 rejects Shard 0's attempt to donate a chunk (same message printed) since it is already trying to donate nssB's chunk to Shard 0.
      7. No chunk migrations occur, because the shards denied each other's migrations.
      8. Repeat from the top.

      While the above situation of every chunk migration failing is less likely in our passthroughs (due to more collections / shards etc), I noticed that on some runs of multi_stmt_txn_jscore_passthrough_with_migration only half the chunk migrations were actually going through. See my comment for a patch run and more info.

            Assignee:
            silvia.surroca@mongodb.com Silvia Surroca
            Reporter:
            vishnu.kaushik@mongodb.com Vishnu Kaushik
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: