If the (primary) config server goes down while the balancer is running it can result in stale locks which don't get released even after the config server is brought back up. This can result in a failure to start/stop the balancer and an inability to perform moveChunk commands. Manual intervention is then required to verify the locks are stale and to release them.
There appear to be a few scenarios:
- In rare cases the balancer lock may get stuck, resulting in an inability to start/stop the balancer, ie sh.stopBalancer() will timeout and display an error:
> sh.stopBalancer() Updated 1 new record(s) in 7ms Waiting for active hosts... Waiting for the balancer lock... assert.soon failed: function (){ var lock = db.getSisterDB( "config" ).locks.findOne({ _id : lockId }) if( state == false ) return ! lock || lock.state == 0 if( state == true ) return lock && lock.state == 2 if( state == undefined ) return (beginTS == undefined && lock) || (beginTS != undefined && ( !lock || lock.ts + "" != beginTS + "" ) ) else return lock && lock.state == state }, msg:Waited too long for lock balancer to unlock Error: Printing Stack Trace at printStackTrace (src/mongo/shell/utils.js:37:15) at doassert (src/mongo/shell/assert.js:6:5) at Function.assert.soon (src/mongo/shell/assert.js:110:60) at Function.sh.waitForDLock (src/mongo/shell/utils_sh.js:156:12) at Function.sh.waitForBalancerOff (src/mongo/shell/utils_sh.js:224:12) at Function.sh.waitForBalancer (src/mongo/shell/utils_sh.js:254:12) at Function.sh.stopBalancer (src/mongo/shell/utils_sh.js:126:8) at (shell):1:4 Balancer still may be active, you must manually verify this is not the case using the config.changelog collection. Thu Aug 21 23:01:13.142 assert.soon failed: function (){ var lock = db.getSisterDB( "config" ).locks.findOne({ _id : lockId }) if( state == false ) return ! lock || lock.state == 0 if( state == true ) return lock && lock.state == 2 if( state == undefined ) return (beginTS == undefined && lock) || (beginTS != undefined && ( !lock || lock.ts + "" != beginTS + "" ) ) else return lock && lock.state == state }, msg:Waited too long for lock balancer to unlock at src/mongo/shell/utils_sh.js:228
I also see the following:
> use config switched to db config > db.locks.find({state:2}, {_id:1, why:1}) { "_id": "balancer", "why": "doing balance round" }
- a collection lock may get stuck, resulting in the balancer failing to move chunks. This leads to errors in the mongod.log files:
Thu Aug 21 23:26:35.362 [conn5] about to log metadata event: { _id: "hal.local-2014-08-21T22:26:35-53f6721b510fb7b6210e5146", server: "hal.local", clientAddr: "192.168.1.129:53350", time: new Date(1408659995362), what: "moveChunk.from", ns: "foo.bar", details: { min: { _id: 2.0 }, max: { _id: 3.0 }, step1 of 6: 0, note: "aborted" } }
Similarly the changelog collection shows aborted messages:
> db.changelog.find({}, {"details.note":1}).sort({time:-1}).limit(1) { "_id": "hal.local-2014-08-21T22:37:18-53f6749e510fb7b6210e51b1", "details": { "note": "aborted" }, "time": ISODate("2014-08-21T22:37:18.137Z") }
The locks collection also shows entries similar to:
> db.locks.find({state:2}, {_id:1, why:1}) { "_id": "foo.bar", "why": "migrate-{ _id: 2.0 }" }
The behavior in 2.4.x vs 2.6.x is a little different:
- 2.4.x seems quite susceptible to the problem and it can be triggered quite easily using the repro steps provided.
- 2.6.x seems to recover from stale locks shortly after the restart of the config server in most instances but there have been rare occasions where it also maintains a lock after the restart. When there are stale locks in 2.6.x it typically relates to just the balancer lock.
While my reproduction steps involve taking down the config server it is also likely that the same problem could occur if there are network issues between cluster members and the config server
- is related to
-
SERVER-7260 Balancer lock is not relinquished
- Closed