Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-94502

Nesting shard role into router role breaks collection metadata recovery

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 8.1.0-rc0, 8.0.4
    • Affects Version/s: 7.3.0-rc0, 8.0.0-rc0
    • Component/s: None
    • None
    • Catalog and Routing
    • Fully Compatible
    • ALL
    • v8.0
    • Hide

      Run the attached repro.js test in the "sharding" suite. I've tested it on r8.1.0-alpha-3304-g85bc2e2ee02.

      For older version you can use the repro_old.js instead.

      Show
      Run the attached repro.js test in the "sharding" suite. I've tested it on r8.1.0-alpha-3304-g85bc2e2ee02 . For older version you can use the repro_old.js instead.
    • CAR Team 2024-09-16, CAR Team 2024-09-30
    • 0

      A shard role nested into a router role does not handle StaleConfig exception correctly, breaking the shard versioning protocol.

      In particular, the StaleConfig exception will be caught and handled by the RouterRole that will invalidate and refresh the catalog cache and retry the operation without updating the Database/Collection Sharding State (CSS/DSS).

      Manifestation

      In most of the cases, this will simply cause additional latency in the execution of the query/command because the router role will retry 10 times before bubbling up the error. This will let the ServiceEntryPoint on the shard to finally update the DSS/CSS.

      In case we are executing inside a transaction the situation is worst, and it could happen that the transaction will never succeed even if the driver keeps retrying it. 
      In fact, due to the execution of the transaction, the shard needs to grab locks for the collection before to enter into the router role. This implies that after 10 retries the Router Role will bubble up ShardCannotRefreshDueToLocksHeld instead of StaleConfig. When this error reaches the service entry point we only refresh the catalog cache but not the DSS/CSS.

      This is one example of where we used nested shard role inside router role. So transactions over views are definitely affected by this problem.

        1. repro_old.js
          1 kB
        2. repro.js
          1 kB
        3. repro-lookup.js
          1 kB

            Assignee:
            jordi.serra-torrens@mongodb.com Jordi Serra Torrens
            Reporter:
            tommaso.tocci@mongodb.com Tommaso Tocci
            Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

              Created:
              Updated:
              Resolved: