-
Type: Bug
-
Resolution: Done
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: Aggregation Framework, Querying
-
None
-
Fully Compatible
-
ALL
-
-
QuInt 8 08/28/15
This is a deadlock that occurs when a shard is trying to receive a chunk while also acting as the merge node for an aggregation sort. There are three threads in one process involved.
Thread 1 is servicing a getmore request for an aggregation pipeline. A relevant portion of its stack frame is below:
#7 0x000000000159bf52 in mongo::MessagingPort::recv (this=0x6f5fe40, toSend=..., response=...) at src/mongo/util/net/message_port.cpp:227 #8 0x000000000159c2e5 in mongo::MessagingPort::call (this=<optimized out>, toSend=..., response=...) at src/mongo/util/net/message_port.cpp:222 #9 0x0000000000d7921f in mongo::DBClientConnection::call (this=0x3c77c20, toSend=..., response=..., assertOk=<optimized out>, actualServer=<optimized out>) at src/mongo/client/dbclient.cpp:1478 #10 0x0000000000d8b7dc in mongo::DBClientCursor::requestMore (this=this@entry=0x3bcc648) at src/mongo/client/dbclientcursor.cpp:210 #11 0x0000000000d8bcc7 in mongo::DBClientCursor::more (this=0x3bcc648) at src/mongo/client/dbclientcursor.cpp:341 #12 0x00000000010adf73 in mongo::sorter::MergeIterator<mongo::Value, mongo::Document, mongo::DocumentSourceSort::Comparator>::next() () #13 0x00000000010aa611 in mongo::DocumentSourceSort::getNext (this=<optimized out>) at src/mongo/db/pipeline/document_source_sort.cpp:71 #14 0x0000000000f59740 in mongo::PipelineProxyStage::getNextBson (this=this@entry=0x3c75840) at src/mongo/db/exec/pipeline_proxy.cpp:131 #15 0x0000000000f5a34f in mongo::PipelineProxyStage::isEOF (this=0x3c75840) at src/mongo/db/exec/pipeline_proxy.cpp:87 #16 0x0000000001107e84 in mongo::shouldSaveCursorGetMore (finalState=<optimized out>, exec=exec@entry=0x3c75920, isTailable=<optimized out>) at src/mongo/db/query/find.cpp:122 #17 0x0000000001108e6d in mongo::getMore (txn=txn@entry=0x7fffe7a02bd0, ns=ns@entry=0x3c7b5e4 "aggShard.server9444", ntoreturn=ntoreturn@entry=0, cursorid=cursorid@entry=35834991707, exhaust=exhaust@entry=0x7fffe7a01fbd, isCursorAuthorized=isCursorAuthorized@entry=0x7fffe7a01fbe) at src/mongo/db/query/find.cpp:414
We can see that thread 1 is blocked because shouldSaveGetMore has asked its pipeline to answer the predicate isEOF, and the PipelineProxy stage has turned that into a getNext(), which in turn causes the sorter to fetch data on one of its remote cursors. At this point, thread 1 holds the database lock for the source of the aggregation in MODE_IS. Notably, the "remote cursor" is actually in the same process; the stack frames near the top are engaging the network layer to communicate with this very process on another thread.
At this point, thread 2 starts the migrate thread driver run by the recipients of new chunks. The other shard in this cluster (not pictured) has started donated a new chunk to this node, coincidentally on the same database as the one the agg is running on. Thread 2 blocks trying to acquire the database lock in MODE_X, because thread 1 has it in MODE_IS. Stack trace is below.
#13 0x0000000000f00cf7 in mongo::Lock::DBLock::DBLock (this=0x7fffd7602c10, locker=0x3c7b9e0, db=..., mode=<optimized out>) at src/mongo/db/concurrency/d_concurrency.cpp:110 #14 0x00000000012970f1 in mongo::MigrationDestinationManager::_migrateDriver ( this=this@entry=0x1ffbbc0 <mongo::(anonymous namespace)::migrateDestManager>, txn=txn@entry=0x7fffd7602e10, ns=..., min=..., max=..., shardKeyPattern=..., fromShard=..., epoch=..., writeConcern=...) at src/mongo/db/s/migration_destination_manager.cpp:434
Now, thread 3 starts trying to process the getmore requested issued to it from thread 1. It attempts to acquire the same database lock as threads 1 and 2 in MODE_IS. However, since thread 2 is waiting in MODE_X, it prevents any new threads from acquiring in MODE_IS. We are now deadlocked. Here's thread 3's stack:
#13 0x0000000000f00cf7 in mongo::Lock::DBLock::DBLock (this=0x648b430, locker=0x3bf1e00, db=..., mode=<optimized out>) at src/mongo/db/concurrency/d_concurrency.cpp:110 #14 0x0000000001108e12 in make_unique<mongo::Lock::DBLock, mongo::Locker*, mongo::StringData, mongo::LockMode> () at src/third_party/boost-1.56.0/boost/smart_ptr/make_unique_object.hpp:28 #15 mongo::getMore (txn=txn@entry=0x7ffff2ffbbd0, ns=ns@entry=0x5cf1084 "aggShard.server9444", ntoreturn=ntoreturn@entry=0, cursorid=cursorid@entry=36214372988, exhaust=exhaust@entry=0x7ffff2ffafbd, isCursorAuthorized=isCursorAuthorized@entry=0x7ffff2ffafbe) at src/mongo/db/query/find.cpp:404
Relevant build failures:
https://evergreen.mongodb.com/task/mongodb_mongo_master_solaris_64_bit_aggregation_aa82208d53a4b642441adbadd4bd0595c4fa40ae_15_08_11_21_38_16
- is related to
-
SERVER-18203 Aggregation distributed deadlock with sharding
- Closed
-
SERVER-17839 Remove PlanStage::isEOF
- Backlog
-
SERVER-18841 Make awaitData cursors block rather than busy wait in the OP_GET_MORE path
- Closed