The cleanupOrphaned command currently checks the metadata state before actually waiting for range deletions in the shard.
Separately, in config transition suites the config server might be the destination of migrations. If the config server stops being a data bearing shard, then any ongoing migration might fail in any of its steps, like for example, while trying to commit the migration. If this happens, we momentarily clear the filtering metadata, so, the following interleaving could make a cleanupOrphaned command to fail in a shard that is the source of a migration in a config transition suite:
- A migration starts with destination shard the config server
- Before entering the critical section, there is a call to the transitionToDedicatedConfigServer command
- A cleanupOrphaned command reaches the destination shard of the migration
- Before checking the metadata, the migration tries to commit, but, it fails because the config server is draining, so it clears the shard metadata
- The cleanupOrphaned command fails because it does not find metadata in the node
This could cause tests to fail on config transition suites (like for example, the cleanupOrphanedWhileMigrating.js FSM workload). We could adjust tests to handle this type of errors, or make cleanupOrphaned more robust by retrying the refresh/waitForClean in a loop. You can find a reproducible attached.
- related to
-
SERVER-93222 Deprecate cleanupOrphaned
- Open