-
Type: Bug
-
Resolution: Done
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: Storage, WiredTiger
-
Fully Compatible
-
ALL
Here's an example:
- Checkpoint from K to N
- Drop command at L
- This results in a complete stall from M to N (including the ftdc thread, which explains the lack of ftdc samples from M to N)
Stack traces collected during the stall show why: multiple operations are all stuck in dropAllQueued which in turn is blocked in __wt_session_drop:
00000066`976bf198 00007ffc`4b93dd39 ntdll!ZwWaitForSingleObject+0xa
00000066`976bf1a0 00007ffc`4b93b704 ntdll!RtlLockHeap+0x169
00000066`976bf270 00007ff7`ef7e839d ntdll!RtlEnterCriticalSection+0xe4
00000066`976bf2b0 00007ff7`ef7e4d07 mongod!__wt_session_drop+0x8d
00000066`976bf2e0 00007ff7`ef18cfd2 mongod!__session_drop+0xe7
00000066`976bf340 00007ff7`ef19f2ad mongod!mongo::WiredTigerKVEngine::dropAllQueued+0x162
00000066`976bf4c0 00007ff7`ef19c507 mongod!mongo::WiredTigerSessionCache::releaseSession+0x19d
00000066`976bf520 00007ff7`ef19c584 mongod!mongo::WiredTigerRecoveryUnit::~WiredTigerRecoveryUnit+0x57
00000066`976bf560 00007ff7`eef1d6c9 mongod!mongo::WiredTigerRecoveryUnit::`scalar deleting destructor'+0x14
00000066`976bf590 00007ff7`eef1d724 mongod!mongo::OperationContextImpl::~OperationContextImpl+0x79
00000066`976bf5d0 00007ff7`ef0fb839 mongod!mongo::OperationContextImpl::`scalar deleting destructor'+0x14
00000066`976bf600 00007ff7`ef812245 mongod!mongo::ServiceContext::OperationContextDeleter::operator()+0x59
00000066`976bf640 00007ff7`ef2e57f7 mongod!mongo::MyMessageHandler::process+0xf5
00000066`976bf860 00007ff7`ef016bdd mongod!mongo::PortMessageServer::handleIncomingMsg+0x4a7
The FTDC thread is similarly stuck:
00000063`1823f498 00007ffc`4b93dd39 ntdll!ZwWaitForSingleObject+0xa
00000063`1823f4a0 00007ffc`4b93b704 ntdll!RtlLockHeap+0x169
00000063`1823f570 00007ff7`ef7e839d ntdll!RtlEnterCriticalSection+0xe4
00000063`1823f5b0 00007ff7`ef7e4d07 mongod!__wt_session_drop+0x8d
00000063`1823f5e0 00007ff7`ef18cfd2 mongod!__session_drop+0xe7
00000063`1823f640 00007ff7`ef19f2ad mongod!mongo::WiredTigerKVEngine::dropAllQueued+0x162
00000063`1823f7c0 00007ff7`ef19c507 mongod!mongo::WiredTigerSessionCache::releaseSession+0x19d
00000063`1823f820 00007ff7`ef19c584 mongod!mongo::WiredTigerRecoveryUnit::~WiredTigerRecoveryUnit+0x57
00000063`1823f860 00007ff7`eef1d6c9 mongod!mongo::WiredTigerRecoveryUnit::`scalar deleting destructor'+0x14
00000063`1823f890 00007ff7`eef1d724 mongod!mongo::OperationContextImpl::~OperationContextImpl+0x79
00000063`1823f8d0 00007ff7`ef0fb839 mongod!mongo::OperationContextImpl::`scalar deleting destructor'+0x14
00000063`1823f900 00007ff7`eee7a490 mongod!mongo::ServiceContext::OperationContextDeleter::operator()+0x59
00000063`1823f940 00007ff7`eee7bf1d mongod!mongo::FTDCCollectorCollection::collect+0x300
00000063`1823fb80 00007ff7`ef016bdd mongod!mongo::FTDCController::doLoop+0x37d
I believe the sequence is:
- at L the drop command attempts to drop the collection but gets an EBUSY from __wt_session_drop due to the checkpoint, so it queues the drop command for later execution
- every second from L to M an attempt is made to drop the collection again but again gets an EBUSY from __wt_session_drop so it remains queued and operations continue
- but starting at M __wt_session_drop no longer returns EBUSY but instead gets stuck waiting on the checkpoint, resulting in the stall
It looks like there are two problems:
- due to the logic in dropAllQueued in the integration layer, every completing operation on all threads attempts to execute the queued drop; this is why the stall is a complete stall, instead of affecting only one thread
- starting at M, during the critical phase of the check point (coinciding with the "pinned transactions" stat), WT blocks in __wt_session_drop instead of returning EBUSY