-
Type: Bug
-
Resolution: Done
-
Priority: Trivial - P5
-
Affects Version/s: 2.0.4
-
Component/s: Replication
-
None
-
Environment:Ubuntu 10.04 in Amazon EC2
We have been testing replica set reliability under a few different failure scenarios. One scenario that failed is when we misconfigured network routing to a mongod primary. We blocked all inbound traffic to port 27017, but allowed it to continue making outbound connections. The replica set was a 3-node set where the primary (node A) had a higher priority than the other two (node B and node C).
What happened when we blocked port 27017 to node A is that node B assumed the primary role, as expected. However, node A then made an outbound connection to node B, and since it had a higher priority A told B to step down as primary, which it did. However, since neither B nor C could make a connection to node A, they both eventually voted that node B should become master again. A again connects to B and the whole process repeats indefinitely.
Not that this is at all a typical failure scenario, but I'm thinking that node A should not have been able to tell B to step down as primary in this situation.
Here are the relevant log entries from node A:
Mon Jun 4 15:23:29 [ReplicaSetMonitorWatcher] trying reconnect to graphdb-4-2.strcst.net:27017
Mon Jun 4 15:23:29 [ReplicaSetMonitorWatcher] reconnect graphdb-4-2.strcst.net:27017 ok
Mon Jun 4 15:23:31 [rsHealthPoll] replSet member graphdb-4-2.strcst.net:27017 is now in state PRIMARY
Mon Jun 4 15:23:32 [rsSync] replSet syncing to: graphdb-4-2.strcst.net:27017
Mon Jun 4 15:23:37 [rsMgr] stepping down graphdb-4-2.strcst.net:27017
Mon Jun 4 15:23:37 [rsSync] replSet syncThread: 10278 dbclient error communicating with server: graphdb-4-2.strcst.net:27017
Mon Jun 4 15:23:37 [rsHealthPoll] replSet member graphdb-4-2.strcst.net:27017 is now in state SECONDARY
And here are the corresponding log entries from node B:
Mon Jun 4 15:23:25 [initandlisten] connection accepted from 10.209.29.204:56081 #419426
Mon Jun 4 15:23:30 [rsMgr] replSet info electSelf 1
Mon Jun 4 15:23:30 [rsMgr] replSet PRIMARY
Mon Jun 4 15:23:37 [conn419426] replSet info stepping down as primary secs=1
Mon Jun 4 15:23:37 [conn419426] replSet relinquishing primary state
Mon Jun 4 15:23:37 [conn419426] replSet SECONDARY
Mon Jun 4 15:23:37 [conn419426] replSet closing client sockets after reqlinquishing primary
- related to
-
SERVER-1929 handle replica set flapping
- Closed