Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-8351

Improve evergreen failure reporting when format exits early due to config bug

    • Type: Icon: Improvement Improvement
    • Resolution: Gone away
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • 1
    • Storage Engines - 2022-09-05

      Our automated evergreen tests did not capture the sufficient information to easily identify the issue in WT-8349.  

      In that ticket, format failed because of an invalid value in its CONFIG file.  Running the config manually using format.sh produces a useful error message and a core file:

       

      $ ./format.sh -c CONFIG.49
      format.sh: starting job in /data/mci/artifacts/test/format/RUNDIR.1 (Thu Nov  4 16:35:55 UTC 2021)
      format.sh:  ./t -c /data/mci/artifacts/test/format/CONFIG.49 -h /data/mci/artifacts/test/format/RUNDIR.1   quiet=1
      ./format.sh: line 584:  7887 Aborted                 (core dumped) nohup setsid $cmd > $log 2>&1
      ...
      format.sh: job in /data/mci/artifacts/test/format/RUNDIR.1 failed
          t: process 7887 running
          t: FAILED: t: cache=1008483: value outside min/max values of 1-102400: Invalid argument
          
          t: run FAILED
          t: process aborting
          WiredTiger Error: aborting WiredTiger library
      format.sh: /data/mci/artifacts/test/format/RUNDIR.1 does not exist, format.sh unable to continue
      ./format.sh: line 358:  7897 Killed                  nohup setsid $cmd > $log 2>&1
      format.sh: 0 successful jobs, 1 failed jobs
      

      In the above, the output of the format log file includes the cause of the problem:

       t: FAILED: t: cache=1008483: value outside min/max values of 1-102400: Invalid argument

       

      The logs from the Evergreen task where this originally occurred are less helpful:

      format.sh: job in /data/mci/67ee547007fad22ac2e27af848423d7c/wiredtiger/test/format/RUNDIR.49 exited with status 127 for an unknown reason
      format.sh: reporting job in /data/mci/67ee547007fad22ac2e27af848423d7c/wiredtiger/test/format/RUNDIR.49 as a failure
      format.sh: job in /data/mci/67ee547007fad22ac2e27af848423d7c/wiredtiger/test/format/RUNDIR.49 failed
          t: process 45945 running
          Killed

      The above doesn't provide any information about why the failure happened.  The artifacts from this run also don't include a core file.  So I had to run the CONFIG to identify the problem.  This wasn't hard, but it would have saved time if the useful error message had been captured and presented in the evergreen log.

      I haven't tried to identify the exact issue, but the difference between the Evergreen run and my local run appears to be what was written to the RUNDIR.xx.log file and then dumped by format.sh.  

      Note that my manual run was on a spawn host machine the evergreen test system.

            Assignee:
            mick.graham@mongodb.com Mick Graham
            Reporter:
            keith.smith@mongodb.com Keith Smith
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: