Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-93193

cleanupOrphaned might fail with UnknownError error on config transition suites

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 8.1.0-rc0
    • Affects Version/s: 9.0 Required, 8.0.0-rc17
    • Component/s: Sharding
    • None
    • Cluster Scalability
    • Fully Compatible
    • ALL
    • v8.0
    • Hide

      1. Run the reproducible attached using the sharding suite with base version c3475ffa8

      Show
      1. Run the reproducible attached using the sharding suite with base version c3475ffa8
    • Cluster Scalability 2024-07-22, Cluster Scalability 2024-08-19
    • 0

      The cleanupOrphaned command currently checks the metadata state before actually waiting for range deletions in the shard.

      Separately, in config transition suites the config server might be the destination of migrations. If the config server stops being a data bearing shard, then any ongoing migration might fail in any of its steps, like for example, while trying to commit the migration. If this happens, we momentarily clear the filtering metadata, so, the following interleaving could make a cleanupOrphaned command to fail in a shard that is the source of a migration in a config transition suite:

      1. A migration starts with destination shard the config server
      2. Before entering the critical section, there is a call to the transitionToDedicatedConfigServer command
      3. A cleanupOrphaned command reaches the destination shard of the migration
      4. Before checking the metadata, the migration tries to commit, but, it fails because the config server is draining, so it clears the shard metadata
      5. The cleanupOrphaned command fails because it does not find metadata in the node

      This could cause tests to fail on config transition suites (like for example, the cleanupOrphanedWhileMigrating.js FSM workload). We could adjust tests to handle this type of errors, or make cleanupOrphaned more robust by retrying the refresh/waitForClean in a loop. You can find a reproducible attached.

        1. BF-34382.repro
          4 kB
          Marcos José Grillo Ramirez

            Assignee:
            janna.golden@mongodb.com Janna Golden
            Reporter:
            marcos.grillo@mongodb.com Marcos José Grillo Ramirez
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: