Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-6037

RS members should report on heartbeat if they cannot reach the node hb-ing them

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Trivial - P5 Trivial - P5
    • 2.3.2
    • Affects Version/s: 2.0.4
    • Component/s: Replication
    • None
    • Environment:
      Ubuntu 10.04 in Amazon EC2

      We have been testing replica set reliability under a few different failure scenarios. One scenario that failed is when we misconfigured network routing to a mongod primary. We blocked all inbound traffic to port 27017, but allowed it to continue making outbound connections. The replica set was a 3-node set where the primary (node A) had a higher priority than the other two (node B and node C).

      What happened when we blocked port 27017 to node A is that node B assumed the primary role, as expected. However, node A then made an outbound connection to node B, and since it had a higher priority A told B to step down as primary, which it did. However, since neither B nor C could make a connection to node A, they both eventually voted that node B should become master again. A again connects to B and the whole process repeats indefinitely.

      Not that this is at all a typical failure scenario, but I'm thinking that node A should not have been able to tell B to step down as primary in this situation.

      Here are the relevant log entries from node A:
      Mon Jun 4 15:23:29 [ReplicaSetMonitorWatcher] trying reconnect to graphdb-4-2.strcst.net:27017
      Mon Jun 4 15:23:29 [ReplicaSetMonitorWatcher] reconnect graphdb-4-2.strcst.net:27017 ok
      Mon Jun 4 15:23:31 [rsHealthPoll] replSet member graphdb-4-2.strcst.net:27017 is now in state PRIMARY
      Mon Jun 4 15:23:32 [rsSync] replSet syncing to: graphdb-4-2.strcst.net:27017
      Mon Jun 4 15:23:37 [rsMgr] stepping down graphdb-4-2.strcst.net:27017
      Mon Jun 4 15:23:37 [rsSync] replSet syncThread: 10278 dbclient error communicating with server: graphdb-4-2.strcst.net:27017
      Mon Jun 4 15:23:37 [rsHealthPoll] replSet member graphdb-4-2.strcst.net:27017 is now in state SECONDARY

      And here are the corresponding log entries from node B:
      Mon Jun 4 15:23:25 [initandlisten] connection accepted from 10.209.29.204:56081 #419426
      Mon Jun 4 15:23:30 [rsMgr] replSet info electSelf 1
      Mon Jun 4 15:23:30 [rsMgr] replSet PRIMARY
      Mon Jun 4 15:23:37 [conn419426] replSet info stepping down as primary secs=1
      Mon Jun 4 15:23:37 [conn419426] replSet relinquishing primary state
      Mon Jun 4 15:23:37 [conn419426] replSet SECONDARY
      Mon Jun 4 15:23:37 [conn419426] replSet closing client sockets after reqlinquishing primary

            Assignee:
            milkie@mongodb.com Eric Milkie
            Reporter:
            mhobbs Mike Hobbs
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: