RangeDeleter currently uses the logCursorsWaiting helper to log the list of cursors it waits. The helper method extracts this information from the cursorsToWait member of the RangeDeleteEntry. The problem is that the the blocking method (deleteNow) keeps track of the cursor independently and does update the cursors in RangeDeleteEntry, so it is not logged properly:
https://github.com/mongodb/mongo/blob/r3.3.3/src/mongo/db/range_deleter.cpp#L320
Original title: Chunk range deleter gets stuck with empty id list
Original description:
We are using mongo Balancer to balance chunks evenly to all shards with tag aware sharding.
Since balancer deletes can hit our IO very hard we set waitForDelete to true.
Balancer settings { "_id" : "balancer", "_secondaryThrottle" : true, "_waitForDelete" : true, "stopped" : false, "activeWindow" : { "start" : "19:00", "stop" : "1:00" } }This time we captured enough log entrys which shows when the balancer gets stuck on rangedelete. (attaching complete log entrys)
/var/log/mongodb/mongodb-shard.log-20160316.gz:2016-03-15T19:16:56.513+0000 I SHARDING [conn702709] doing delete inline for cleanup of chunk data /var/log/mongodb/mongodb-shard.log-20160316.gz:2016-03-15T19:16:56.513+0000 I SHARDING [conn702709] MigrateFromStatus::done About to acquire global lock to exit critical section /var/log/mongodb/mongodb-shard.log-20160316.gz:2016-03-15T19:16:56.513+0000 I SHARDING [conn702709] waiting for open cursors before removing range [{ source: "AU:2073920479" }, { source: "AU:2073975491" }) in database001.collection1, cursor ids: [] /var/log/mongodb/mongodb-shard.log-20160316.gz:2016-03-15T19:31:57.011+0000 I SHARDING [conn702709] waiting for open cursors before removing range [{ source: "AU:2073920479" }, { source: "AU:2073975491" }) in database001.collection1, elapsed secs: 900, cursor ids: []Interesting here is that list of cursor ids is empty from start.
This operation runs now since more than 6 days and blocks other migrations on this shard and due to our balancer setting also the balancer.{ "inprog" : [ { "desc" : "conn702709", "threadId" : "0x19c4aa40", "connectionId" : 702709, "opid" : "repset5:-1973023014", "active" : true, "secs_running" : 574207, "microsecs_running" : NumberLong("574207823437"), "op" : "query", "ns" : "database001.collection1", "query" : { "$msg" : "query not recording (too large)" }, "client_s" : "172.X.X.X:39474", "msg" : "step 5 of 6", "killPending" : true, "numYields" : 527, "locks" : { }, "waitingForLock" : false, "lockStats" : { "Global" : { "acquireCount" : { "r" : NumberLong(1792878), "w" : NumberLong(2), "R" : NumberLong(2) } }, "MMAPV1Journal" : { "acquireCount" : { "r" : NumberLong(899439), "w" : NumberLong(2) }, "acquireWaitCount" : { "r" : NumberLong(5686) }, "timeAcquiringMicros" : { "r" : NumberLong(357486150) } }, "Database" : { "acquireCount" : { "r" : NumberLong(896437), "w" : NumberLong(2) } }, "Collection" : { "acquireCount" : { "R" : NumberLong(896437), "W" : NumberLong(2) }, "acquireWaitCount" : { "R" : NumberLong(3000) }, "timeAcquiringMicros" : { "R" : NumberLong(110598818) } } } } ] }