-
Type: New Feature
-
Resolution: Won't Fix
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Admin, Replication
-
None
-
Replication
At foursquare we implemented a method to mark a replica in a set as unhealthy. Here's how it works:
the mongod monitors for the presence of a kill file. If the file is present, then the mongod will make itself ineligible to be primary of a replica set, including stepping down if it's already primary. It will also return its kill-file status via serverStatus so the mongoS can refuse to send queries to a killed secondary.
Some more details:
- mongod returns an additional healthStatus object from serverStatus. that object contains an "ok" boolean, a descriptive "msg" message, and a boolean named "killfile" indicating if a kill file is present
- mongoS's existence replica set polling thread now polls the mongod's serverStatus instead of isMaster. If healthStatus.ok is false or serverStatus times out N times in a row, the mongoS stops sending requests to that secondary. If the host is a primary, nothing happens at the mongoS level.
- every second, a KillFileWatcher thread in the mongod checks for the presence of a kill file. If the file is present, three things happen:
- the mongod's serverStatus will return a healhStatus block with ok=false; msg=a message indicating the presence of the kill file and its contents if any; and killfile=true.
- the mongod will mark itself as not electable by forcing ReplSetImpl::iAmPotentiallyHot to return false.
- If the mongod is a primary in a replica set, it will issue a stepdown(60) so someone else can take over.
Additional details:
- the health status is returned from mongod to mongos via the db.adminCommand("connPoolStats"), and there is now a flag that adjust the polling frequency of this command on mongos.
- since secondary querying is affected, we'd have to hack the driver to make this work for non-sharded clusters.
- there is the case where all replicas report a failing health status. when this happens, the primary steps down and no primary is elected. this is assumed to be a catastrophic case that deserves attention, so we're living with this.
We think it would be generally useful to build this functionality into the mainline code. Our customizations are here:
https://github.com/foursquare/mongo/commit/6ff5bd021d98f25406b74b1cb89d276d0b403ce2
- is related to
-
SERVER-14983 Ability to immediately mark the node as unable to service user queries
- Open
- related to
-
SERVER-14576 mongod automatic shutdown on stdin close
- Backlog