-
Type: Improvement
-
Resolution: Won't Do
-
Priority: Major - P3
-
None
-
Affects Version/s: 2.6.6
-
Component/s: Sharding
-
Sharding EMEA
If a shard is currently unable to do w:majority writes, it should not be considered as a target for balancer moves. Such migrations are highly likely to stall for an extended period (up to 10 hours) at pre-commit, causing orphans and wasted time (during which other migrations cannot occur).
While the shard could become healthy and able to support w:majority during the migration, it would usually be better to wait for that to be known before starting the migration. If the problematic shard is avoided, the balancer can instead be balancing chunks amongst the remaining shards, since these migrations will have a much better chance of completing successfully.
This is only for chunk moves by the balancer. If a user knows that the target shard will be able to catch up a majority within 10 hours, then they should be able to run a manual moveChunk. By contrast, the balancer has no way of knowing this state of affairs. In the absence of this knowledge, it should assume the state of the system won't change, ie. the migration won't succeed.
To repro this (2.6.6):
$ mlaunch init --replicaset --nodes 2 --arbiter --smallfiles --oplogSize 100 --sharded 2 --config 1 --mongos 1 --port 33333 launching: mongod on port 33334 launching: mongod on port 33335 launching: mongod on port 33336 launching: mongod on port 33337 launching: mongod on port 33338 launching: mongod on port 33339 launching: config server on port 33340 replica set 'shard01' initialized. replica set 'shard02' initialized. launching: mongos on port 33333 adding shards. can take up to 30 seconds... $ mlaunch stop shard02 secondary 1 node stopped. $ mlaunch list PROCESS PORT STATUS PID mongos 33333 running 5919 config server 33340 running 5826 shard01 primary 33334 running 5463 secondary 33335 running 5483 arbiter 33336 running 5530 shard02 primary 33337 running 5582 mongod 33338 down - arbiter 33339 running 5757 $ mongo --port 33333 MongoDB shell version: 2.6.6 connecting to: 127.0.0.1:33333/test > sh.enableSharding("test") { "ok" : 1 } > sh.shardCollection("test.test", {_id:1}) { "collectionsharded" : "test.test", "ok" : 1 } > db.test.insert({_id:-1}) WriteResult({ "nInserted" : 1 }) > db.test.insert({_id:1}) WriteResult({ "nInserted" : 1 }) > sh.splitAt("test.test", {_id:0}) { "ok" : 1 } > sleep(60*1000) null > db.getSiblingDB("config").locks.find() { "_id" : "configUpgrade", "state" : 0, "who" : "genique:33333:1421043589:1804289383:mongosMain:846930886", "ts" : ObjectId("54b3678501891d25c2a0de29"), "process" : "genique:33333:1421043589:1804289383", "when" : ISODate("2015-01-12T06:19:49.271Z"), "why" : "upgrading config database to new format v5" } { "_id" : "balancer", "state" : 2, "who" : "genique:33333:1421043589:1804289383:Balancer:1681692777", "ts" : ObjectId("54b3743d01891d25c2a0e051"), "process" : "genique:33333:1421043589:1804289383", "when" : ISODate("2015-01-12T07:14:05.817Z"), "why" : "doing balance round" } { "_id" : "test.test", "state" : 2, "who" : "genique:33334:1421046843:292657564:conn8:1901975678", "ts" : ObjectId("54b3743d26b588f243d1e4a7"), "process" : "genique:33334:1421046843:292657564", "when" : ISODate("2015-01-12T07:14:05.830Z"), "why" : "migrate-{ _id: MinKey }" }
Logfile on shard02 primary confirms
$ grep migrateThread data/shard02/rs1/mongod.log | head -10 2015-01-12T18:14:05.843+1100 [migrateThread] starting receiving-end of migration of chunk { _id: MinKey } -> { _id: 0.0 } for collection test.test from shard01/genique:33334,genique:33335 at epoch 54b370c601891d25c2a0dfbc 2015-01-12T18:14:05.843+1100 [migrateThread] starting new replica set monitor for replica set shard01 with seeds genique:33334,genique:33335 2015-01-12T18:14:05.845+1100 [migrateThread] allocating new ns file /home/kev/testing/balancer-majority/data/shard02/rs1/db/test.ns, filling with zeroes... 2015-01-12T18:14:05.966+1100 [migrateThread] build index on: test.test properties: { v: 1, key: { _id: 1 }, name: "_id_", ns: "test.test" } 2015-01-12T18:14:05.967+1100 [migrateThread] added index to empty collection 2015-01-12T18:14:05.978+1100 [migrateThread] Helpers::removeRangeUnlocked time spent waiting for replication: 0ms 2015-01-12T18:15:05.979+1100 [migrateThread] warning: secondaryThrottle on, but doc insert timed out after 60 seconds, continuing 2015-01-12T18:15:05.980+1100 [migrateThread] Waiting for replication to catch up before entering critical section 2015-01-12T18:15:06.980+1100 [migrateThread] Waiting for replication to catch up before entering critical section 2015-01-12T18:15:07.980+1100 [migrateThread] Waiting for replication to catch up before entering critical section
In 2.8.0-rc4 the only difference is a 1 hour timeout for the initial rangeDeleter, after which the migration receive fails and is retried (I also disabled secondaryThrottle in this test):
$ grep migrateThread data/shard02/rs1/mongod.log | head -20 2015-01-12T21:31:51.871+1100 I SHARDING [migrateThread] starting receiving-end of migration of chunk { _id: MinKey } -> { _id: 0.0 } for collection test.test from shard01/genique:23334,genique:23335 at epoch 54b3a02aaea4b148f49e0882 2015-01-12T21:31:51.871+1100 I NETWORK [migrateThread] starting new replica set monitor for replica set shard01 with seeds genique:23334,genique:23335 2015-01-12T21:31:51.874+1100 I INDEX [migrateThread] allocating new ns file /home/kev/testing/balancer-majority-2.8/data/shard02/rs1/db/test.ns, filling with zeroes... 2015-01-12T21:31:51.993+1100 I INDEX [migrateThread] build index on: test.test properties: { v: 1, key: { _id: 1 }, name: "_id_", ns: "test.test" } 2015-01-12T21:31:51.993+1100 I INDEX [migrateThread] building index using bulk method 2015-01-12T21:31:51.993+1100 I INDEX [migrateThread] build index done. scanned 0 total records. 0 secs 2015-01-12T21:31:51.993+1100 I SHARDING [migrateThread] Deleter starting delete for: test.test from { _id: MinKey } -> { _id: 0.0 }, with opId: 749 2015-01-12T21:31:51.993+1100 I SHARDING [migrateThread] rangeDeleter deleted 0 documents for test.test from { _id: MinKey } -> { _id: 0.0 } 2015-01-12T22:31:51.995+1100 I SHARDING [migrateThread] rangeDeleter timed out after 0 seconds while waiting for deletions to be replicated to majority nodes 2015-01-12T22:31:51.995+1100 W SHARDING [migrateThread] Failed to queue delete for migrate abort: rangeDeleter timed out after 0 seconds while waiting for deletions to be replicated to majority nodes 2015-01-12T22:31:51.996+1100 I NETWORK [migrateThread] scoped connection to shard01/genique:23334,genique:23335 not being returned to the pool 2015-01-12T22:31:51.996+1100 I SHARDING [migrateThread] about to log metadata event: { _id: "genique-2015-01-12T11:31:51-54b3b0a75e62b2a409dceac3", server: "genique", clientAddr: ":27017", time: new Date(1421062311996), what: "moveChunk.to", ns: "test.test", details: { min: { _id: MinKey }, max: { _id: 0.0 }, step 1 of 5: 122, note: "aborted" } } 2015-01-12T22:31:52.001+1100 W SHARDING [migrateThread] cannot remove pending chunk [{ _id: MinKey }, { _id: 0.0 }), this shard does not contain the chunk 2015-01-12T22:31:52.001+1100 W SHARDING [migrateThread] cannot remove pending chunk [{ _id: MinKey }, { _id: 0.0 }), this shard does not contain the chunk 2015-01-12T22:32:03.032+1100 I SHARDING [migrateThread] starting receiving-end of migration of chunk { _id: MinKey } -> { _id: 0.0 } for collection test.test from shard01/genique:23334,genique:23335 at epoch 54b3a02aaea4b148f49e0882 2015-01-12T22:32:03.033+1100 I SHARDING [migrateThread] Deleter starting delete for: test.test from { _id: MinKey } -> { _id: 0.0 }, with opId: 6925 2015-01-12T22:32:03.033+1100 I SHARDING [migrateThread] rangeDeleter deleted 0 documents for test.test from { _id: MinKey } -> { _id: 0.0 } 2015-01-12T22:32:03.034+1100 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section 2015-01-12T22:32:04.034+1100 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section 2015-01-12T22:32:05.034+1100 I SHARDING [migrateThread] Waiting for replication to catch up before entering critical section