Triaging and diagnosing hang failures in automated Evergreen tests should be easier. After determining that a test is hung, Evergreen should automatically collect and report data that will help with the initial triage and diagnosis of the problem. Ideally we might collect:
- What test programs were running at the time of the hang.
- The WiredTiger directory for those tests (I believe we already keep this for all tests)
- Cores of the hung process(es), to help engineers determine why they were hung
- Stack traces from the hung processes, to include in the Evergreen logs to facilitate triage.
There is probably other stuff that would be useful as well.
MongoDB's resmoke.py includes a hang-analyzer that they use for this purpose, buildscripts/resmokelib/commands/hang_analyzer.py. We might be able to use it as the basis for a WT hang analyzer, or simply steal it outright.