-
Type: Improvement
-
Resolution: Gone away
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Storage Execution
-
Execution Team 2021-11-15, Execution Team 2021-11-29, Execution Team 2021-12-13
-
6
SERVER-58311 surfaced a race condition whereby we call into WiredTiger Rollback To Stable (RTS) while some operations that were intended to be killed are found to be still running. This is not expected as WiredTiger usually requires RTS to have exclusive access to a dhandle; a running user operation can't guarantee that. This is observed when building the server with these non-default flags.
E STORAGE 22435 [BackgroundSync] "WiredTiger error","attr":{"error":16,"message":"[1625743985:510865][6205:0x7f6a41a84700], file:collection-XXX.wt, txn rollback_to_stable: __rollback_to_stable_btree_apply, 1522: file:sizeStorer.wt: unable to open handle, error indicates handle is unavailable due to concurrent use: Device or resource busy"}
The problematic operation surviving the killOp seems to be a yielded oplog getmore that waits for more oplog inserts via CappedInsertNotifier. Stack at the time of failure:
#0 0x00007f3adfe33f85 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0 #1 0x00007f3ae16e6705 in __gthread_cond_timedwait (__cond=0x7f3aad53f7b0, __mutex=0x7f3ab54de0b0, __abs_timeout=0x7f3a8da22ce0) at /opt/mongodbtoolchain/revisions/d4e86835db48288718d967ff002f9945c7243aac/stow/gcc-v3.oGe/lib/gcc/x86_64-mongodb-linux/8.2.0/../../../../in clude/c++/8.2.0/x86_64-mongodb-linux/bits/gthr-default.h:871 #2 0x00007f3ae17125e6 in std::condition_variable::__wait_until_impl<std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (this=0x7f3aad53f7b0, __lock=..., __atime=...) at /opt/mongodbtoolchain/revisions/d4e86835db48288718d967ff002f9945c7243aac/stow/gcc-v3.oGe/li b/gcc/x86_64-mongodb-linux/8.2.0/../../../../include/c++/8.2.0/condition_variable:178 #3 0x00007f3ae1712535 in std::condition_variable::wait_until<std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (this=0x7f3aad53f7b0, __lock=..., __atime=...) at /opt/mongodbtoolchain/revisions/d4e86835db48288718d967ff002f9945c7243aac/stow/gcc-v3.oGe/lib/gcc/x 86_64-mongodb-linux/8.2.0/../../../../include/c++/8.2.0/condition_variable:106 #4 0x00007f3ade9d55fc in std::_V2::condition_variable_any::wait_until<std::unique_lock<mongo::latch_detail::Latch>, std::chrono::_V2::system_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (this=0x7f3aad53f7b0, __lock=..., __atime=...) at /opt/mongod btoolchain/revisions/d4e86835db48288718d967ff002f9945c7243aac/stow/gcc-v3.oGe/lib/gcc/x86_64-mongodb-linux/8.2.0/../../../../include/c++/8.2.0/condition_variable:286 #5 0x00007f3acac8a2bf in mongo::CappedInsertNotifier::waitUntil (this=0x7f3aad53f7b0, prevVersion=18, deadline=...) at src/mongo/db/catalog/collection.cpp:53 #6 0x00007f3ad436699a in mongo::insert_listener::waitForInserts(mongo::OperationContext*, mongo::PlanYieldPolicy*, mongo::insert_listener::CappedInsertNotifierData*)::$_1::operator()() const (this=0x7f3a8da230c0) at src/mongo/db/query/plan_insert_listener.cpp:108 #7 0x00007f3ad43667fd in std::_Function_handler<void (), mongo::insert_listener::waitForInserts(mongo::OperationContext*, mongo::PlanYieldPolicy*, mongo::insert_listener::CappedInsertNotifierData*)::$_1>::_M_invoke(std::_Any_data const&) (__functor=...) at /opt/mongodbtoolchain/revisions/d4e86835db48288718d967ff002f9945c7243aac/stow/gcc-v3.oGe/lib/gcc/x86_64-mongodb-linux/8.2.0/../../../../include/c++/8.2.0/bits/std_function.h:297 #8 0x00007f3adf2c491e in std::function<void ()>::operator()() const (this=0x7f3a8da230c0) at /opt/mongodbtoolchain/revisions/d4e86835db48288718d967ff002f9945c7243aac/stow/gcc-v3.oGe/lib/gcc/x86_64-mongodb-linux/8.2.0/../../../../include/c++/8.2.0/bits/std_function.h:687 #9 0x00007f3ad084066e in mongo::PlanYieldPolicy::performYield(mongo::OperationContext*, mongo::Yieldable const*, std::function<void ()>) (this=0x7f3a9ee0e380, opCtx=0x7f3aacbbc220, yieldable=0x7f3a8da24a10, whileYieldingFn=...) at src/mongo/db/query/plan_yield_policy.cpp:180 #10 0x00007f3ad0840134 in mongo::PlanYieldPolicy::yieldOrInterrupt(mongo::OperationContext*, std::function<void ()>) (this=0x7f3a9ee0e380, opCtx=0x7f3aacbbc220, whileYieldingFn=...) at src/mongo/db/query/plan_yield_policy.cpp:108 #11 0x00007f3ad4366450 in mongo::insert_listener::waitForInserts (opCtx=0x7f3aacbbc220, yieldPolicy=0x7f3a9ee0e380, notifierData=0x7f3a8da234b0) at src/mongo/db/query/plan_insert_listener.cpp:106 #12 0x00007f3ad433d3b8 in mongo::PlanExecutorImpl::_getNextImpl (this=0x7f3a94d37020, objOut=0x7f3a8da23578, dlOut=0x0) at src/mongo/db/query/plan_executor_impl.cpp:451 #13 0x00007f3ad433c598 in mongo::PlanExecutorImpl::getNextDocument (this=0x7f3a94d37020, objOut=0x7f3a94d370f0, dlOut=0x0) at src/mongo/db/query/plan_executor_impl.cpp:311 #14 0x00007f3ad433c37a in mongo::PlanExecutorImpl::getNext (this=0x7f3a94d37020, objOut=0x7f3a8da23900, dlOut=0x0) at src/mongo/db/query/plan_executor_impl.cpp:298 #15 0x00007f3abf2e9b85 in mongo::(anonymous namespace)::GetMoreCmd::Invocation::generateBatch (this=0x7f3aaee32480, opCtx=0x7f3aacbbc220, cursor=0x7f3a94d2d620, cmd=..., isTailable=true, nextBatch=0x7f3a8da243d0, numResults=0x7f3a8da23ff0, docUnitsReturned=0x7f3a8da23fd8) at src/mongo/db/commands/getmore_cmd.cpp:335 #16 0x00007f3abf2e7cf2 in mongo::(anonymous namespace)::GetMoreCmd::Invocation::acquireLocksAndIterateCursor (this=0x7f3aaee32480, opCtx=0x7f3aacbbc220, reply=0x7f3ab5446ca0, cursorManager=0x7f3ab54053a0, cursorPin=..., curOp=0x7f3a94d1a7b8) at src/mongo/db/commands/getmore_cmd.cpp:641 #17 0x00007f3abf2e5f2f in mongo::(anonymous namespace)::GetMoreCmd::Invocation::run (this=0x7f3aaee32480, opCtx=0x7f3aacbbc220, reply=0x7f3ab5446ca0) at src/mongo/db/commands/getmore_cmd.cpp:745
Leads of investigation:
- Does the logic waiting for operations to get killed wait for lock-free reads, which do not acquire the RSTL lock?
- Similarly, does that logic wait for operations that have yielded and therefore do not hold the RSTL?
- is related to
-
SERVER-74809 Yield call into storage engine after releasing global lock
- Closed
-
SERVER-58311 MongoDB calls WT RTS API while holding an open cursor on _mdb_catalog
- Closed
-
SERVER-63989 Retry rollback_to_stable until all concurrent operations finish
- Closed
- related to
-
WT-8429 Improve error handling when rollback_to_stable fails
- Closed
-
SERVER-88944 Use something better than the GlobalLock to protect storage engine access
- Backlog
-
SERVER-63989 Retry rollback_to_stable until all concurrent operations finish
- Closed