-
Type: Task
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: Testing Infrastructure
-
None
-
Fully Compatible
-
v3.6
-
TIG 2018-05-07, TIG 2018-05-21, TIG 2018-06-04, TIG 2018-06-18
-
12
-
5
The changes from SERVER-19630 make it so FSM workloads run as individual test cases in the concurrency_sharded_causal_consistency{,_and_balancer}.yml and concurrency_sharded_replication{,_and_balancer}.yml test suites. The concurrency_sharded_with_stepdowns{,_and_balancer}.yml test suites weren't migrated to the new-style because there are parts of setting up the environment to run the FSM workloads under that aren't prepared to have the primary of the CSRS or replica set shard stepped down. Rather than trying to get the all the retry logic correct (e.g. by handling the ManualInterventionRequired when attempting to shard the collection), we should instead delay when resmoke.py's StepdownThread actually runs after the FSM workload has started.
A sketch of the interactions between the _StepdownThread class and resmoke_runner.js via the filesystem is described in the appropriate place of the runWorkloads() function below.
diff --git a/jstests/concurrency/fsm_libs/resmoke_runner.js b/jstests/concurrency/fsm_libs/resmoke_runner.js index d94fd4e31c..af0afca2bb 100644 --- a/jstests/concurrency/fsm_libs/resmoke_runner.js +++ b/jstests/concurrency/fsm_libs/resmoke_runner.js @@ -104,6 +104,15 @@ cleanup.push(workload); }); + // After the $config.setup() function has been called, it is safe for the stepdown + // thread to start running. The main thread won't attempt to interact with the cluster + // until all of the spawned worker threads have finished. + // + // TODO: Call writeFile('./stepdown_permitted', '') function to indicate that the + // stepdown thread can run. It is unnecessary for the stepdown thread to indicate that + // it is going to start running because it will eventually after the worker threads have + // started. + // Since the worker threads may be running with causal consistency enabled, we set the // initial clusterTime and initial operationTime for the sessions they'll create so that // they are guaranteed to observe the effects of the workload's $config.setup() function @@ -128,17 +137,34 @@ } try { - // Start this set of worker threads. - threadMgr.spawnAll(cluster, executionOptions); - // Allow 20% of the threads to fail. This allows the workloads to run on - // underpowered test hosts. - threadMgr.checkFailed(0.2); + try { + // Start this set of worker threads. + threadMgr.spawnAll(cluster, executionOptions); + // Allow 20% of the threads to fail. This allows the workloads to run on + // underpowered test hosts. + threadMgr.checkFailed(0.2); + } finally { + // Threads must be joined before destruction, so do this even in the presence of + // exceptions. + errors.push(...threadMgr.joinAll().map( + e => new WorkloadFailure( + e.err, e.stack, e.tid, 'Foreground ' + e.workloads.join(' ')))); + } } finally { - // Threads must be joined before destruction, so do this even in the presence of - // exceptions. - errors.push(...threadMgr.joinAll().map( - e => new WorkloadFailure( - e.err, e.stack, e.tid, 'Foreground ' + e.workloads.join(' ')))); + // Until we are guaranteed that the stepdown thread isn't running, it isn't safe for + // the $config.teardown() function to be called. We should signal to resmoke.py that + // the stepdown thread should stop running and wait for the stepdown thread to + // signal that it has stopped. + // + // TODO: Call removeFile('./stepdown_permitted') so the next time the stepdown + // thread checks to see if it should keep running that it instead stops stepping + // down the cluster and creates a file named "./stepdown_off". + // + // TODO: Call the ls() function inside of an assert.soon() / assert.soonNoExcept() + // and wait for the "./stepdown_off" file to be created. assert.soonNoExcept() + // should probably be used so that an I/O-related error from attempting to list the + // contents of the directory while the file is being created doesn't lead to a + // JavaScript exception that causes the test to fail. } } finally { // Call each workload's teardown function. After all teardowns have completed check if
- causes
-
SERVER-36169 Resmoke: bare raise outside except in the stepdown hook
- Closed
- depends on
-
SERVER-35051 Resmoke should stop the balancer before shutting down sharded clusters
- Closed
- related to
-
SERVER-41096 ContinuousStepdown thread and resmoke runner do not synchronize properly on the "stepdown permitted file" and "stepping down file"
- Closed