-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: Replication
-
None
-
Fully Compatible
-
ALL
-
v4.4
-
Execution Team 2020-05-04, Execution Team 2020-05-18
Found the deadlock when testing a behavior in drain mode. Stepdown command hangs. Attached are the log and waiting graph.
- OplogApplier-0 thread is waiting for all writers to finish while holding the PBWM lock.
- Writers are waiting on the RSTL held by the stepdown thread.
- Stepdown thread is waiting on the PBWM lock in order to clear OplogTruncateAfterPoint, while holding the RSTL in X mode.
Here's the stepdown thread's stacktrace.
Thread 66: "conn20" (Thread 0x7f1731b4d700 (LWP 29264)) #0 0x00007f176410d403 in poll () from /lib64/libc.so.6 #1 0x00007f1769a477e7 in mongo::transport::TransportLayerASIO::BatonASIO::run (this=0x7f1773a25a90, clkSource=0x7f176d570a90) at /opt/mongodbtoolchain/revisions/toolchain_builder_rhel62_patch_57df61bfe08e9c2805a38d201c5cd6c0ea414778_5daee4e830661503f51984bf_19_10_22_11_16_02/stow/gcc-v3.cZb/include/c++/8.2.0/bits/stl_vector.h:805 #2 0x00007f1769a44add in mongo::transport::TransportLayerASIO::BatonASIO::run_until (this=this@entry=0x7f1773a25a90, clkSource=0x7f176d570a90, deadline=...) at src/mongo/transport/baton_asio_linux.h:247 #3 0x00007f1769f4da21 in mongo::Waitable::wait_until<mongo::BasicLockableAdapter>(mongo::Waitable*, mongo::ClockSource*, mongo::stdx::condition_variable&, mongo::BasicLockableAdapter&, std::chrono::time_point<std::chrono::_V2::system_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const&)::{lambda()#1}::operator()() const (this=<optimized out>) at src/mongo/util/lockable_adapter.h:57 #4 mongo::stdx::condition_variable::_runWithNotifyable<mongo::Waitable::wait_until<mongo::BasicLockableAdapter>(mongo::Waitable*, mongo::ClockSource*, mongo::stdx::condition_variable&, mongo::BasicLockableAdapter&, std::chrono::time_point<std::chrono::_V2::system_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const&)::{lambda()#1}>(mongo::Notifyable&, mongo::BasicLockableAdapter&&) (cb=..., notifyable=..., this=0x7f17739e5178) at src/mongo/stdx/condition_variable.h:162 #5 mongo::Waitable::wait_until<mongo::BasicLockableAdapter> (timeout_time=..., lk=..., cv=..., clkSource=<optimized out>, waitable=0x7f1773a25a90) at src/mongo/util/waitable.h:92 #6 mongo::ClockSource::waitForConditionUntil (this=<optimized out>, cv=..., bla=..., deadline=..., deadline@entry=..., waitable=0x7f1773a25a90) at src/mongo/util/clock_source.cpp:48 #7 0x00007f1769f43600 in mongo::OperationContext::<lambda()>::operator() (__closure=<optimized out>) at src/mongo/util/lockable_adapter.h:50 #8 mongo::OperationContext::waitForConditionOrInterruptNoAssertUntil (this=0x7f1773514380, cv=..., m=..., deadline=...) at src/mongo/db/operation_context.cpp:309 #9 0x00007f1769ef8792 in mongo::Interruptible::<lambda(mongo::Date_t, mongo::Interruptible::WakeSpeed)>::operator()(mongo::Date_t, mongo::Interruptible::WakeSpeed) const (this=0x7f1773514380, deadline=..., speed=mongo::Interruptible::WakeSpeed::kSlow) at src/mongo/util/lockable_adapter.h:50 #10 0x00007f1769ef8ccf in mongo::Interruptible::<lambda(mongo::Date_t, mongo::Interruptible::WakeSpeed)>::operator() (speed=mongo::Interruptible::WakeSpeed::kSlow, deadline=..., this=<optimized out>) at src/mongo/db/concurrency/lock_state.cpp:246 #11 mongo::Interruptible::waitForConditionOrInterruptUntil<std::unique_lock<mongo::latch_detail::Latch>, mongo::CondVarLockGrantNotification::wait(mongo::OperationContext*, mongo::Milliseconds)::<lambda()> > (waitTimer=0x0, pred=..., finalDeadline=..., m=..., cv=..., this=0x7f1773514380) at src/mongo/util/interruptible.h:448 #12 mongo::Interruptible::waitForConditionOrInterruptFor<std::unique_lock<mongo::latch_detail::Latch>, mongo::CondVarLockGrantNotification::wait(mongo::OperationContext*, mongo::Milliseconds)::<lambda()> > (waitTimer=0x0, pred=..., ms=..., m=..., cv=..., this=0x7f1773514380) at src/mongo/util/interruptible.h:480 #13 mongo::CondVarLockGrantNotification::wait (this=0x7f17739e5128, opCtx=0x7f1773514380, timeout=...) at src/mongo/db/concurrency/lock_state.cpp:245 #14 0x00007f1769efa9e6 in mongo::LockerImpl::_lockComplete (this=0x7f17739e5080, opCtx=0x7f1773514380, resId=..., mode=mongo::MODE_IS, deadline=...) at src/mongo/db/concurrency/lock_state.cpp:945 #15 0x00007f1769eeea4b in mongo::Lock::ResourceLock::lock (this=0x7f1731b491b0, opCtx=<optimized out>, mode=<optimized out>, deadline=...) at src/mongo/db/concurrency/d_concurrency.cpp:317 #16 0x00007f1769eeebd3 in mongo::Lock::GlobalLock::GlobalLock (this=0x7f1731b491a0, opCtx=0x7f1773514380, lockMode=mongo::MODE_IX, deadline=..., behavior=<optimized out>) at src/mongo/db/concurrency/d_concurrency.cpp:152 #17 0x00007f1769eeecbb in mongo::Lock::DBLock::DBLock (this=0x7f1731b49188, opCtx=0x7f1773514380, db="local", mode=<optimized out>, deadline=...) at src/mongo/db/concurrency/lock_manager_defs.h:106 #18 0x00007f1768e0785f in mongo::AutoGetDb::AutoGetDb (this=0x7f1731b49160, opCtx=0x7f1773514380, dbName="local", mode=mongo::MODE_IX, deadline=...) at /opt/mongodbtoolchain/revisions/toolchain_builder_rhel62_patch_57df61bfe08e9c2805a38d201c5cd6c0ea414778_5daee4e830661503f51984bf_19_10_22_11_16_02/stow/gcc-v3.cZb/include/c++/8.2.0/ext/new_allocator.h:86 #19 0x00007f1768e07cfe in mongo::AutoGetCollection::AutoGetCollection (this=0x7f1731b49160, opCtx=0x7f1773514380, nsOrUUID=..., modeColl=mongo::MODE_IX, viewMode=mongo::AutoGetCollection::kViewsForbidden, deadline=...) at src/mongo/base/string_data.h:64 #20 0x00007f1768314b4e in mongo::repl::StorageInterfaceImpl::<lambda()>::operator()(void) const (__closure=0x7f1731b493d0) at src/mongo/db/repl/storage_interface_impl.cpp:930 #21 0x00007f17683153cd in mongo::writeConflictRetry<mongo::repl::StorageInterfaceImpl::upsertById(mongo::OperationContext*, const mongo::NamespaceStringOrUUID&, const mongo::BSONElement&, const mongo::BSONObj&)::<lambda()> > (f=..., ns=..., opStr=..., opCtx=0x7f1773514380) at src/mongo/util/fail_point.h:235 #22 mongo::repl::StorageInterfaceImpl::upsertById (this=<optimized out>, opCtx=<optimized out>, nsOrUUID=..., idKey=..., update=...) at src/mongo/db/repl/storage_interface_impl.cpp:974 #23 0x00007f176846fc8b in mongo::repl::ReplicationConsistencyMarkersImpl::_upsertOplogTruncateAfterPointDocument (this=<optimized out>, opCtx=0x7f1773514380, updateSpec=owned BSONObj 49 bytes @ 0x7f1773604598) at /opt/mongodbtoolchain/revisions/toolchain_builder_rhel62_patch_57df61bfe08e9c2805a38d201c5cd6c0ea414778_5daee4e830661503f51984bf_19_10_22_11_16_02/stow/gcc-v3.cZb/include/c++/8.2.0/bits/char_traits.h:285 #24 0x00007f176846ff17 in mongo::repl::ReplicationConsistencyMarkersImpl::setOplogTruncateAfterPoint (this=0x7f176db86b80, opCtx=0x7f1773514380, timestamp=...) at src/mongo/base/string_data.h:74 #25 0x00007f176832d304 in mongo::repl::ReplicationCoordinatorExternalStateImpl::_stopAsyncUpdatesOfAndClearOplogTruncateAfterPoint (this=0x7f176db98600) at src/mongo/bson/timestamp.h:76 #26 0x00007f176835d5c8 in mongo::repl::ReplicationCoordinatorImpl::_performPostMemberStateUpdateAction (this=0x7f176dc0a800, action=action@entry=mongo::repl::ReplicationCoordinatorImpl::kActionSteppedDown) at /opt/mongodbtoolchain/revisions/toolchain_builder_rhel62_patch_57df61bfe08e9c2805a38d201c5cd6c0ea414778_5daee4e830661503f51984bf_19_10_22_11_16_02/stow/gcc-v3.cZb/include/c++/8.2.0/bits/unique_ptr.h:342 #27 0x00007f176835f91a in mongo::repl::ReplicationCoordinatorImpl::<lambda()>::operator()(void) const (__closure=0x7f1731b499d0) at src/mongo/db/repl/replication_coordinator_impl.cpp:2582 #28 0x00007f17683711f0 in mongo::repl::ReplicationCoordinatorImpl::stepDown (this=0x7f176dc0a800, opCtx=<optimized out>, force=<optimized out>, waitTime=..., stepdownTime=...) at src/mongo/db/repl/replication_coordinator_impl.cpp:2672 #29 0x00007f1768325bc0 in mongo::repl::CmdReplSetStepDown::run (this=<optimized out>, opCtx=0x7f1773514380, cmdObj=..., result=...) at src/mongo/util/duration.h:241 #30 0x00007f1768fa904a in mongo::BasicCommand::runWithReplyBuilder (replyBuilder=0x7f1773603c30, cmdObj=owned BSONObj 242 bytes @ 0x7f1772eca91d, db="admin", opCtx=0x7f1773514380, this=0x7f176b347c40 <mongo::repl::cmdReplSetStepDown>) at src/mongo/db/commands.h:799 #31 mongo::BasicCommandWithReplyBuilderInterface::Invocation::run (this=0x7f17739e1a20, opCtx=0x7f1773514380, result=0x7f1773603c30) at src/mongo/db/commands.cpp:764 #32 0x00007f1768fa2f4f in mongo::CommandHelpers::runCommandInvocation (opCtx=0x7f1773514380, request=..., invocation=0x7f17739e1a20, response=0x7f1773603c30) at src/mongo/db/commands.cpp:184 #33 0x00007f176872be15 in mongo::(anonymous namespace)::runCommandImpl (opCtx=<optimized out>, invocation=<optimized out>, request=..., replyBuilder=0x7f1773603c30, startOperationTime=..., behaviors=..., extraFieldsBuilder=0x7f1731b4a3b0, sessionOptions=...) at src/mongo/db/service_entry_point_common.cpp:851 #34 0x00007f176872ee19 in mongo::(anonymous namespace)::execCommandDatabase (opCtx=<optimized out>, command=<optimized out>, request=..., replyBuilder=<optimized out>, behaviors=...) at src/mongo/db/service_entry_point_common.cpp:1177
It's unclear why this isn't caught by existing stepdown in drain mode tests, like jstests/replsets/step_down_during_draining.js and its friends, probably because the burn-in tests in my patch build runs much more times than master branch. My new test timed out 3 time in burn-in tests. The logic was introduced in SERVER-46984 on 3/20/20. SERVER-46201 refactored the code after that.
- related to
-
SERVER-46984 Stop async updates to the oplogTruncateAfterPoint during primary server shutdown prior to clearing the oplogTruncateAfterPoint
- Closed
-
SERVER-46201 Implement a generic ReplicaSetStateAwareService
- Closed