-
Type: Improvement
-
Resolution: Fixed
-
Priority: Minor - P4
-
Affects Version/s: None
-
Component/s: None
-
Server Development Platform
-
Fully Compatible
-
DAG 2023-10-16, DAG 2023-10-30
-
2
Can the test harness please log an explicit and informative message that it is about to kill the test processes before doing so. The current behavior is not easily recognizable as being an intentional kill from the test harness and can easily be mistaken for a crash of the server.
Example from BF-27450:
Writing fatal message message: ExternalRecordStoreTest NamedPipeMultiplePipes4 Writing fatal message message: Got signal: 6 (Abort trap: 6). mongo::stack_trace_detail::(anonymous namespace)::getStackTraceImpl(mongo::stack_trace_detail::(anonymous namespace)::Options const&) mongo::printStackTrace() abruptQuit _sigtramp __srefill1 __fread fread std::__1::basic_filebuf<char, std::__1::char_traits<char> >::underflow() std::__1::basic_streambuf<char, std::__1::char_traits<char> >::uflow() std::__1::basic_streambuf<char, std::__1::char_traits<char> >::xsgetn(char*, long) std::__1::basic_istream<char, std::__1::char_traits<char> >::read(char*, long) mongo::NamedPipeInput::doRead(char*, int) mongo::InputStream<mongo::NamedPipeInput>::readBytes(int, char*) mongo::MultiBsonStreamCursor::nextFromCurrentStream() mongo::MultiBsonStreamCursor::next() mongo::UnitTest_SuiteNameExternalRecordStoreTestTestNameNamedPipeMultiplePipes4::_doTest() mongo::unittest::Test::run()
I am told that I am not the only person who has been fooled by this. This looked to me like the reason the test timed out was because the server had crashed and stack dumped and therefore stopped making progress, but the reality was that the server had stopped making progress an hour ago and then the test harness sent "kill -6" to abort the test.
A message something like the following would be helpful to avoid time wasted investigating the wrong thing, and also make it easier for humans using Parsley to find the failure (currently timed out tests do not have the word "timeout" anywhere in the logs):
TEST TIMEOUT FAILURE ABORT: Aborting the test via "kill -6" because it has not made progress for one hour.