There is a bug in the script that executes checkpoint-stress-test from our evergreen.yml.
Here's the relevant code:
for i in $(seq ${times|1}); do for t in $(seq ${no_of_procs|1}); do eval nohup $CMD > nohup.out.$i.$t 2>&1 & done for t in $(seq ${no_of_procs|1}); do ret=0 wait -n || ret=$? if [ $ret -ne 0 ]; then # Skip the below lines from nohup output file because they are very verbose and # print only the errors to evergreen log file. grep -v "Finished verifying" nohup.out.* | grep -v "Finished a checkpoint" | grep -v "thread starting" fi exit $ret. <<============ done done
The test first loops to start multiple instances of the test. Then it loops waiting for the same number of instances to terminate. The problem is the exit command (indicated with <<========). It will be called on the first iteration through the second loop, regardless of whether the terminated task succeeded or failed.
So if we run ten concurrent tests, this script will exit after the first of the ten complete and therefore won't check for or report any errors from the other tests.
As simple fix would be to move the exit into the previous if ... fi block, so we only exit early if one of the tests fails. Better would be to note the failure but allow the other concurrent tests to finish so any other failures will be reported.