Reproduce
Apply the patch on top of 324bc4e858959bbd77d111bddb4b166048723312
Start mongod as a single member replica set node:
rm -rf /data/db/* && ./mongod --replSet rs0 --oplogSize 100
When mongod is running, run mongo restart.js. This reproduces every run for me.
Observations
On the first call to `restartStorageEngine`, the server crashes with the following error message and backtrace:
memory stomping bug: a word after object at 0x7ffff33e99a0 has been corrupted (gdb) bt #0 0x00007ffff6cf2428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54 #1 0x00007ffff6cf402a in __GI_abort () at abort.c:89 #2 0x00005555597c497a in LogPrintf (severity=-4, pat=0x555559979bca "memory stomping bug: a word after object at %p has been corrupted", ap=0x7ffff7ecce00) at src/third_party/gperftools-2.5/src/base/logging.h:209 #3 0x00005555597cf1e8 in RAW_LOG (lvl=-4, pat=0x555559979bca "memory stomping bug: a word after object at %p has been corrupted") at src/third_party/gperftools-2.5/src/base/logging.h:228 #4 0x00005555597eb6ba in MallocBlock::CheckLocked (this=0x7ffff33e9980, type=-271733872) at src/third_party/gperftools-2.5/src/debugallocation.cc:455 #5 0x00005555597ecb56 in MallocBlock::CheckAndClear (this=0x7ffff33e9980, type=-271733872, given_size=0) at src/third_party/gperftools-2.5/src/debugallocation.cc:410 #6 0x00005555597ecaef in MallocBlock::Deallocate (this=0x7ffff33e9980, type=-271733872, given_size=0) at src/third_party/gperftools-2.5/src/debugallocation.cc:564 #7 0x00005555597e238a in DebugDeallocate (ptr=0x7ffff33e99a0, type=-271733872, given_size=0) at src/third_party/gperftools-2.5/src/debugallocation.cc:1039 #8 0x000055555988ab5e in tc_free (ptr=0x7ffff33e99a0) at src/third_party/gperftools-2.5/src/debugallocation.cc:1229 #9 0x00005555572cfcca in __wt_free_int (session=0x7ffff392d020, p_arg=0x7ffff5e662c0) at src/third_party/wiredtiger/src/os_common/os_alloc.c:311 #10 0x0000555557341de7 in __wt_txn_rollback_to_stable (session=0x7ffff392d020, cfg=0x7ffff7ecd0d0) at src/third_party/wiredtiger/src/txn/txn_rollback_to_stable.c:431 #11 0x000055555729b8c1 in __conn_rollback_to_stable (wt_conn=0x7ffff5e5f020, config=0x0) at src/third_party/wiredtiger/src/conn/conn_api.c:1192 #12 0x0000555557249c9f in mongo::WiredTigerKVEngine::recoverToStableTimestamp (this=0x7ffff5ff3da0) at src/mongo/db/storage/wiredtiger/wiredtiger_kv_engine.cpp:1117 #13 0x00005555574492d3 in mongo::KVStorageEngine::recoverToStableTimestamp (this=0x7ffff5d0ffe0, opCtx=0x7ffff32e88e0) at src/mongo/db/storage/kv/kv_storage_engine.cpp:445 #14 0x000055555762bd0c in mongo::recoverToStableTimestamp (opCtx=0x7ffff32e88e0) at src/mongo/db/recover_to_timestamp.cpp:58 #15 0x00005555575c4cd5 in mongo::RestartStorageEngineCmd::run (this=0x55555ade94f8 <mongo::restartCmd>, opCtx=0x7ffff32e88e0, db=..., cmdObj=owned BSONObj 50 bytes @ 0x7ffff33e7ea8 = {...}, result=...) at src/mongo/db/commands/restart_store_engine.cpp:75 #16 0x0000555558f7492b in mongo::BasicCommand::enhancedRun (this=0x55555ade94f8 <mongo::restartCmd>, opCtx=0x7ffff32e88e0, request=..., result=...) at src/mongo/db/commands.cpp:390 #17 0x0000555558f72e2d in mongo::Command::publicRun (this=0x55555ade94f8 <mongo::restartCmd>, opCtx=0x7ffff32e88e0, request=..., result=...) at src/mongo/db/commands.cpp:328 #18 0x000055555751c692 in mongo::(anonymous namespace)::runCommandImpl (opCtx=0x7ffff32e88e0, command=0x55555ade94f8 <mongo::restartCmd>, request=..., replyBuilder=0x7ffff30cfb60, startOperationTime=...) at src/mongo/db/service_entry_point_mongod.cpp:458 #19 0x000055555751a73a in mongo::(anonymous namespace)::execCommandDatabase (opCtx=0x7ffff32e88e0, command=0x55555ade94f8 <mongo::restartCmd>, request=..., replyBuilder=0x7ffff30cfb60) at src/mongo/db/service_entry_point_mongod.cpp:705 #20 0x00005555575181a3 in mongo::(anonymous namespace)::runCommands(mongo::OperationContext*, mongo::Message const&)::$_4::operator()() const ( this=0x7ffff7ecf950) at src/mongo/db/service_entry_point_mongod.cpp:822 #21 0x0000555557515307 in mongo::(anonymous namespace)::runCommands (opCtx=0x7ffff32e88e0, message=...) at src/mongo/db/service_entry_point_mongod.cpp:778 #22 0x0000555557513f54 in mongo::ServiceEntryPointMongod::handleRequest (this=0x7ffff5ceca60, opCtx=0x7ffff32e88e0, m=...) at src/mongo/db/service_entry_point_mongod.cpp:1091 #23 0x00005555575309fe in mongo::ServiceStateMachine::_processMessage (this=0x7ffff6020b30, guard=...) at src/mongo/transport/service_state_machine.cpp:317 #24 0x000055555752cfff in mongo::ServiceStateMachine::_runNextInGuard (this=0x7ffff6020b30, guard=...) at src/mongo/transport/service_state_machine.cpp:407 #25 0x000055555752d91f in mongo::ServiceStateMachine::runNext (this=0x7ffff6020b30) at src/mongo/transport/service_state_machine.cpp:373 #26 0x000055555752a262 in mongo::ServiceEntryPointImpl::startSession(std::shared_ptr<mongo::transport::Session>)::$_1::operator()() ( this=0x7ffff6191420) at src/mongo/transport/service_entry_point_impl.cpp:89 #27 0x0000555557529ffd in std::_Function_handler<void (), mongo::ServiceEntryPointImpl::startSession(std::shared_ptr<mongo::transport::Session>)::$_1>::_M_invoke(std::_Any_data const&) (__functor=...) at /opt/mongodbtoolchain/v2/bin/../lib/gcc/x86_64-mongodb-linux/5.4.0/../../../../include/c++/5.4.0/functional:1871 #28 0x0000555557441f1e in std::function<void ()>::operator()() const (this=0x7ffff6190d90) at /opt/mongodbtoolchain/v2/bin/../lib/gcc/x86_64-mongodb-linux/5.4.0/../../../../include/c++/5.4.0/functional:2267 #29 0x00005555594b38ec in mongo::(anonymous namespace)::runFunc (ctx=0x7ffff6190d90) at src/mongo/transport/service_entry_point_utils.cpp:55 #30 0x00007ffff708e6ba in start_thread (arg=0x7ffff7ed1700) at pthread_create.c:333 #31 0x00007ffff6dc43dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
After 100 calls to setStableTimestamp, the server stops advancing the stable timestamp. Thus there is real work for rollback_to_stable to do. I tried to reduce restart.js to the minimum set of calls to reproduce the error, but I unfortunately did not get far. The problem started when I had added some writes to a local collection, i.e: where WT table logging is engaged, thus the table should not be rolled back. Changing let testLocal = true; to false should cause the script to succeed. Note that in addition to this user collection in local, the oplog also has journaling enabled and otherwise exhibits correct behavior on rollback.
- links to