Balancer started to take 100% CPU after upgrading to 3.4.0 from 3.2.9 and enabling the balancer. This is a DB with 4 shards (rs1, rs2, rs3 and rs4), but before upgrading we removed one shard (rs5), waiting until the drain completed.
In the log I can see warnings like these for several collections:
2016-12-13T09:52:11.211+0000 W SHARDING [Balancer] Unable to enforce tag range policy for collection eplus.wifiCollection_20161008 :: caused by :: Location10181: not sharded:eplus.wifiCollection_20161008 2016-12-13T09:52:13.087+0000 W SHARDING [Balancer] Unable to enforce tag range policy for collection eplus.wifiCollection_20161009 :: caused by :: Location10181: not sharded:eplus.wifiCollection_20161009 2016-12-13T09:53:38.583+0000 W SHARDING [Balancer] Unable to balance collection eplus.wifiCollection_20161008 :: caused by :: Location10181: not sharded:eplus.wifiCollection_20161008 2016-12-13T09:53:40.360+0000 W SHARDING [Balancer] Unable to balance collection eplus.wifiCollection_20161009 :: caused by :: Location10181: not sharded:eplus.wifiCollection_20161009
Those collections are created and dropped after some days. And indeed those collections were dropped and are not shown in "db.getCollectionNames()".
I investigated a bit and found those collections in config DB:
{ "_id" : "eplus.wifiCollection_20161008", "lastmodEpoch" : ObjectId("000000000000000000000000"), "lastmod" : ISODate("2016-10-18T04:00:13.108Z"), "dropped" : true } { "_id" : "eplus.wifiCollection_20161009", "lastmodEpoch" : ObjectId("000000000000000000000000"), "lastmod" : ISODate("2016-10-19T04:00:48.158Z"), "dropped" : true }
And there are locks for many collections related to the removed shard (rs5):
{ "_id" : "eplus.wifiCollection_20160908", "state" : 0, "ts" : ObjectId("5837ee01c839440f1e70d384"), "who" : "wifi-db-05a:27018:1475838481:-1701389523:conn104", "process" : "wifi-db-05a:27018:1475838481:-1701389523", "when" : ISODate("2016-11-25T07:53:37.235Z"), "why" : "migrating chunk [{ lineId: 8915926302292949940 }, { lineId: MaxKey }) in eplus.wifiCollection_20160908" } { "_id" : "eplus.wifiCollection_20160909", "state" : 0, "ts" : ObjectId("5837ee01c839440f1e70d38b"), "who" : "wifi-db-05a:27018:1475838481:-1701389523:conn104", "process" : "wifi-db-05a:27018:1475838481:-1701389523", "when" : ISODate("2016-11-25T07:53:37.296Z"), "why" : "migrating chunk [{ lineId: 8915926302292949940 }, { lineId: MaxKey }) in eplus.wifiCollection_20160909" }
Not only there are locks for dropped collections but also for existant collections. Our guessing is that this is causing Balancer to continuously loop over all collections, and thus causing 100% CPU, but we are not sure how to work around.
- related to
-
SERVER-27474 Eliminate "dropped" collections from config server list of collections
- Closed
-
SERVER-27475 mongos should request an update only of the collection not found
- Closed