Loading...

XML

Word

Printable

JSON

Type: New Feature
Resolution: Won't Fix
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: Admin, Replication
Labels:
None

Assigned Teams:

Replication
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

At foursquare we implemented a method to mark a replica in a set as unhealthy. Here's how it works:

the mongod monitors for the presence of a kill file. If the file is present, then the mongod will make itself ineligible to be primary of a replica set, including stepping down if it's already primary. It will also return its kill-file status via serverStatus so the mongoS can refuse to send queries to a killed secondary.

Some more details:

mongod returns an additional healthStatus object from serverStatus. that object contains an "ok" boolean, a descriptive "msg" message, and a boolean named "killfile" indicating if a kill file is present
mongoS's existence replica set polling thread now polls the mongod's serverStatus instead of isMaster. If healthStatus.ok is false or serverStatus times out N times in a row, the mongoS stops sending requests to that secondary. If the host is a primary, nothing happens at the mongoS level.
every second, a KillFileWatcher thread in the mongod checks for the presence of a kill file. If the file is present, three things happen:
1. the mongod's serverStatus will return a healhStatus block with ok=false; msg=a message indicating the presence of the kill file and its contents if any; and killfile=true.
2. the mongod will mark itself as not electable by forcing ReplSetImpl::iAmPotentiallyHot to return false.
3. If the mongod is a primary in a replica set, it will issue a stepdown(60) so someone else can take over.

Additional details:

the health status is returned from mongod to mongos via the db.adminCommand("connPoolStats"), and there is now a flag that adjust the polling frequency of this command on mongos.
since secondary querying is affected, we'd have to hack the driver to make this work for non-sharded clusters.
there is the case where all replicas report a failing health status. when this happens, the primary steps down and no primary is elected. this is assumed to be a catastrophic case that deserves attention, so we're living with this.

We think it would be generally useful to build this functionality into the mainline code. Our customizations are here:

https://github.com/foursquare/mongo/commit/6ff5bd021d98f25406b74b1cb89d276d0b403ce2

is related to

SERVER-14983 Ability to immediately mark the node as unable to service user queries

Open

related to

SERVER-14576 mongod automatic shutdown on stdin close

Backlog

Assignee:: [DO NOT USE] Backlog - Replication Team
Reporter:: Jon Hoffman
Participants:: [DO NOT USE] Backlog - Replication Team, Jon Hoffman, Spencer Brody
Votes:: 2 Vote for this issue
Watchers:: 8 Start watching this issue

Created:: May 24 2012 05:30:28 PM UTC
Updated:: Dec 06 2022 05:33:06 AM UTC
Resolved:: Jun 14 2018 07:41:48 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates