Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 4.0.0-rc3, 4.1.1
Affects Version/s: None
Component/s: Testing Infrastructure
Labels:
None

Backwards Compatibility:
Fully Compatible
Backport Requested:

v3.6
Sprint:
TIG 2018-05-07, TIG 2018-05-21, TIG 2018-06-04, TIG 2018-06-18
Linked BF Score:
12
Story Points:
5
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

The changes from ~~SERVER-19630~~ make it so FSM workloads run as individual test cases in the concurrency_sharded_causal_consistency{,_and_balancer}.yml and concurrency_sharded_replication{,_and_balancer}.yml test suites. The concurrency_sharded_with_stepdowns{,_and_balancer}.yml test suites weren't migrated to the new-style because there are parts of setting up the environment to run the FSM workloads under that aren't prepared to have the primary of the CSRS or replica set shard stepped down. Rather than trying to get the all the retry logic correct (e.g. by handling the ManualInterventionRequired when attempting to shard the collection), we should instead delay when resmoke.py's StepdownThread actually runs after the FSM workload has started.

A sketch of the interactions between the _StepdownThread class and resmoke_runner.js via the filesystem is described in the appropriate place of the runWorkloads() function below.

Unable to find source-code formatter for language: diff. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml

diff --git a/jstests/concurrency/fsm_libs/resmoke_runner.js b/jstests/concurrency/fsm_libs/resmoke_runner.js
index d94fd4e31c..af0afca2bb 100644
--- a/jstests/concurrency/fsm_libs/resmoke_runner.js
+++ b/jstests/concurrency/fsm_libs/resmoke_runner.js
@@ -104,6 +104,15 @@
                 cleanup.push(workload);
             });

+            // After the $config.setup() function has been called, it is safe for the stepdown
+            // thread to start running. The main thread won't attempt to interact with the cluster
+            // until all of the spawned worker threads have finished.
+            //
+            // TODO: Call writeFile('./stepdown_permitted', '') function to indicate that the
+            // stepdown thread can run. It is unnecessary for the stepdown thread to indicate that
+            // it is going to start running because it will eventually after the worker threads have
+            // started.
+
             // Since the worker threads may be running with causal consistency enabled, we set the
             // initial clusterTime and initial operationTime for the sessions they'll create so that
             // they are guaranteed to observe the effects of the workload's $config.setup() function
@@ -128,17 +137,34 @@
             }

             try {
-                // Start this set of worker threads.
-                threadMgr.spawnAll(cluster, executionOptions);
-                // Allow 20% of the threads to fail. This allows the workloads to run on
-                // underpowered test hosts.
-                threadMgr.checkFailed(0.2);
+                try {
+                    // Start this set of worker threads.
+                    threadMgr.spawnAll(cluster, executionOptions);
+                    // Allow 20% of the threads to fail. This allows the workloads to run on
+                    // underpowered test hosts.
+                    threadMgr.checkFailed(0.2);
+                } finally {
+                    // Threads must be joined before destruction, so do this even in the presence of
+                    // exceptions.
+                    errors.push(...threadMgr.joinAll().map(
+                        e => new WorkloadFailure(
+                            e.err, e.stack, e.tid, 'Foreground ' + e.workloads.join(' '))));
+                }
             } finally {
-                // Threads must be joined before destruction, so do this even in the presence of
-                // exceptions.
-                errors.push(...threadMgr.joinAll().map(
-                    e => new WorkloadFailure(
-                        e.err, e.stack, e.tid, 'Foreground ' + e.workloads.join(' '))));
+                // Until we are guaranteed that the stepdown thread isn't running, it isn't safe for
+                // the $config.teardown() function to be called. We should signal to resmoke.py that
+                // the stepdown thread should stop running and wait for the stepdown thread to
+                // signal that it has stopped.
+                //
+                // TODO: Call removeFile('./stepdown_permitted') so the next time the stepdown
+                // thread checks to see if it should keep running that it instead stops stepping
+                // down the cluster and creates a file named "./stepdown_off".
+                //
+                // TODO: Call the ls() function inside of an assert.soon() / assert.soonNoExcept()
+                // and wait for the "./stepdown_off" file to be created. assert.soonNoExcept()
+                // should probably be used so that an I/O-related error from attempting to list the
+                // contents of the directory while the file is being created doesn't lead to a
+                // JavaScript exception that causes the test to fail.
             }
         } finally {
             // Call each workload's teardown function. After all teardowns have completed check if

causes

SERVER-36169 Resmoke: bare raise outside except in the stepdown hook

Closed

depends on

SERVER-35051 Resmoke should stop the balancer before shutting down sharded clusters

Closed

related to

SERVER-41096 ContinuousStepdown thread and resmoke runner do not synchronize properly on the "stepdown permitted file" and "stepping down file"

Closed

Assignee:: Jonathan Abrahams (Inactive)
Reporter:: Max Hirschhorn
Participants:: Githook User, Jonathan Abrahams, Max Hirschhorn
Votes:: 0 Vote for this issue
Watchers:: 4 Start watching this issue

Created:: Apr 18 2018 09:42:35 PM UTC
Updated:: Oct 29 2023 10:32:35 PM UTC
Resolved:: May 31 2018 05:07:22 PM UTC
Confidence Status Last Update:: 02/May/18 7:34 PM

Details

Description

Attachments

Issue Links

Activity

People

Dates