-
Type: Task
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Sharding
-
None
-
Catalog and Routing
The sharding_continuous_config_stepdown.yml test suite has anecdotally been a pain point for the sharding team because it generates many uninteresting, testing-only failures. Going through the last 2 years of ~100 tickets spawned out of sharding_csrs_continuous_config_stepdown Evergreen task failures, there have been:
- 33 instances of a testing-only change being made, almost always to exclude the test from the sharding_continuous_config_stepdown.yml test suite.
- 15 instances of a bug where the server behavior was changed and the sharding_csrs_continuous_config_stepdown Evergreen task failure was the only thing which caught it.
- 12 additional instances of a bug where the server behavior was changed but other Evergreen tasks (e.g. concurrency stepdown suites) also caught it.
The 33 sharding-csrs-stepdown-upkeep labeled SERVER tickets represent a drag on the Sharding NYC and EMEA teams to write new jstests/sharding/ tests. This is too high of an upkeep to merit continuing to have the sharding_continuous_config_stepdown.yml test suite (without significantly rearchitecting it). On the other hand, the 15 sharding-csrs-stepdown-only labeled SERVER tickets are a clear measure of the value provided by the sharding_continuous_config_stepdown.yml test suite. It would be prudent to ensure new (or already later added) coverage was provided elsewhere to prevent a regression.
The task here is to evaluate whether some additional coverage happens to now exist from later sharding projects, and if not, to create additional SERVER tickets to add such coverage before deleting the sharding_continuous_config_stepdown.yml test suite.
Note: The sharding_continuous_config_stepdown.yml test suite also causes the PeriodicShardedIndexConsistencyChecker thread to run more frequently (triggered as part of new config server primary step-up) which has led to other testing-only failures, mainly from $currentOp filters not being specific enough in tests. These cases are not included in the sharding-csrs-stepdown-upkeep labeled tickets.
- related to
-
SERVER-53343 Tests which write to ConfigServer collections are not safe to run in the sharding_csrs_continuous_config_stepdown suite
- Backlog
-
SERVER-58619 Continuous Stepdown's replSetStepDown Is Not Resilient To External Elections
- Backlog
-
SERVER-53094 Tests which use {waitForDelete:true} on moveChunk are not safe to run in the sharding_csrs_continuous_config_stepdown suite
- Closed
-
SERVER-60375 Blacklist move_chunk_remove_shard.js from sharding_csrs_continuous_config_stepdown
- Closed
-
SERVER-60751 move_chunk_critical_section_non_internal_client_abort.js does not consider config server stepdowns
- Closed
-
SERVER-62181 JStests including multiple parallel migrations with failpoints shouldn't be run in the config server stepdown suites
- Closed
-
SERVER-62419 recover_multiple_migrations_on_stepup.js fails when executed in config server stepdown suite
- Closed
-
SERVER-64234 Remove move_chunk_respects_maxtimems.js from sharding_csrs_continuous_config_stepdown suite
- Closed
-
SERVER-67733 ShardingTest::awaitBalancerRound() doesn't work in case of CSRS stepdowns
- Closed
-
SERVER-72820 Retry disable and enable of balancer in awaitCollectionBalance
- Closed
-
SERVER-57626 Investigate disabled move_chunk tests in the sharding_csrs_continous_config_stepdown suite
- Backlog
-
SERVER-59890 Exclude migration_coordinator_shutdown_in_critical_section.js test from the config stepdown suite
- Closed