-
Type: Bug
-
Resolution: Incomplete
-
Priority: Major - P3
-
None
-
Affects Version/s: 2.6.4
-
Component/s: Sharding
-
None
-
ALL
Intro
We are running a sharded cluster of 7 shards. Every shard is a replicaset of 3 replicas.
We do pre-splitting in order to distribute evenly new inserted documents among all shards dependent on their amount of RAM. Our database must fit in RAM else it performance drops down significantly. The balancer is switched off because it thinks all servers have the same amount of RAM which would lead to overload of shards having less RAM.
From time to time, i.e. when adding a new shard, we need to balance manually. Our script executes first sh.moveChunk and when it returns ok, it deletes the moved documents from the source shards because we observed that sh.moveChunk does not always clean-up 100%.
In pseudo code it gives:
for all chunks from donor shard print("move " + (chunk++) + " of id: " + chunk.min._id) move = sh.moveChunk(fromCollection, chunk.min._id, toShard) if(move.ok){ del = fromCollection.remove({_id:{$gte:chunk.min._id, $lt:chunk.max._id}}, { writeConcern: { w: "majority"} }) print("deleted: " + del.nRemoved) }
Problem
We observed that mongo may wait forever when we add the writeConcern "majority" to the remove command.
In the currentOp we see something like this:
{ "opid" : 135910798, "active" : true, "secs_running" : 2362, "microsecs_running" : NumberLong("2362545153"), "op" : "query", "ns" : "", "query" : { "delete" : "offer", "deletes" : [ { "q" : { "_id" : { "$gte" : NumberLong("2577460009"), "$lt" : NumberLong("2577494734") } }, "limit" : 0 } ], "ordered" : true, "writeConcern" : { "w" : "majority" } }, "client" : "172.30.2.130:51426", "desc" : "conn4708952", "threadId" : "0x7f2035ca5700", "connectionId" : 4708952, "waitingForLock" : false, "msg" : "waiting for write concern", "numYields" : 0, "lockStats" : { "timeLockedMicros" : { }, "timeAcquiringMicros" : { } } }
While mongo was waiting, we verified the number of documents of this chunk on all replicaset members of the donor shard. It was 0! So all documents of this chunk were removed from the donor shard but mongo still waited for it. Why?
Also, still more weird, the output of the above pseudo script was like this:
move 1 of id: 11605025 deleted: 5623 move 2 of id: 183729058 deleted: 7152 move 3 of id: 2577460009 deleted: 3561 move 4 of id: 2977458123
Move 4 hung forever.
As you can see, mongo was endlessly waiting for the query $gte:2577460009. However, the result of the delete returned already since it printed the number of deleted documents (3561). Only the following chunk move #4 got stuck. Why?
Also, as you can see, our script had always to delete documents that should have been deleted by the sh.moveChunk already, shouldn't it?