Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Incomplete
Priority: Critical - P2
Fix Version/s: None
Affects Version/s: 2.0.7
Component/s: None
Labels:
None

Operating System:
Linux

I started draining the last of four shards in a live sharded mongo cluster (v2.0.7), with each shard being a 3-node replset, and it went fine until it got to 16 chunks remaining. Now the draining has been stuck there for more than four hours.

mongos> db.runCommand(

{removeShard:"mongo-live-d"}

)
{
"msg" : "draining ongoing",
"state" : "ongoing",
"remaining" :

{ "chunks" : NumberLong(16), "dbs" : NumberLong(0) }

,
"ok" : 1
}

The mongos log shows this:

Wed Nov 21 22:10:26 [Balancer] distributed lock 'balancer/mongo-live-a-1:27017:1350073653:1804289383' acquired, ts : 50adc1d2538fcedc6aa3cf93
Wed Nov 21 22:10:26 [Balancer] biggest shard mongo-live-b has unprocessed writebacks, waiting for completion of migrate
Wed Nov 21 22:10:26 [Balancer] biggest shard mongo-live-b has unprocessed writebacks, waiting for completion of migrate
Wed Nov 21 22:10:26 [Balancer] biggest shard mongo-live-b has unprocessed writebacks, waiting for completion of migrate
Wed Nov 21 22:10:26 [Balancer] distributed lock 'balancer/mongo-live-a-1:27017:1350073653:1804289383' unlocked.

When I check writebacksQueued the total ops never goes down but is increasing over time:

PRIMARY> db.adminCommand("writeBacksQueued")
{
"hasOpsQueued" : true,
"totalOpsQueued" : 603910,
"queues" : { "50787cba376f032868ac165e" :

{ "n" : 0, "minutesSinceLastCall" : 2 }

,
"50787cba4a4a812e093429a5" :

{ "n" : 341466, "minutesSinceLastCall" : 0 }

,
"50787cba5df1e05fedab56ff" :

{ "n" : 1, "minutesSinceLastCall" : 40 }

,
"50787cbadc8a4a2ee5bab98f" :

{ "n" : 262443, "minutesSinceLastCall" : 0 }

,
"50787cbafb83be34cb49a885" :

{ "n" : 0, "minutesSinceLastCall" : 1 }

},
"ok" : 1
}

The "totalOpsQueued" and various "n" values keep going up. I don't see anything interesting in the troublesome shard's mongod log. I'd try restarting everything but I'm worried that this queued data would be lost.

Assignee:: Barrie Segal

Reporter:: Justin Patrin

Participants:: Barrie Segal, Eliot Horowitz, Justin Patrin

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: Nov 26 2012 09:21:10 PM UTC

Updated:: Mar 08 2013 03:56:10 PM UTC

Resolved:: Feb 19 2013 07:02:16 PM UTC

Details

Description

Attachments

Activity

People

Dates