Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-6300

Node 2 is master but attempts to replicate from Node 3

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 2.0.6
    • Component/s: Replication
    • None
    • ALL

      I have been running some replication tests, and found something quite puzzling regarding failovers.

      Looking at packet capture off the bridge interface (attached) and log files (snippets attached): After failure on network connection of Node 1 (which was master, last successful communications around packet 2414), Node 2 became master (2566 seems to be the first heartbeat with Node 2 as master). The client eventually timed out as expected. When the client attempted to perform the next update on Node 2 (packets 3271 and 3273), the write timed out. It appears that Node 2 thinks it is master, but instead of Node 3 replicating from Node 2, Node 2 is instead asking for replication updated from Node 3 (e.g. 3288). I think this is a bug.

      The replication setup is a standard 3-node replica set. The nodes run in separate VirtualBox VMs with bridged networking. The network interfaces are connected to TAP interfaces on the host and connected together using a bridge device. In the test in question, a single client running on the host updates numbers in collection 'test' on the cluster one at a time. A failure is introduced by killing network connection of one of the VMs by setting packet loss on its TAP interface to 100% with netem.

      Snippet from mongodb.log of Node 2 when it became master:
      Wed Jul 4 16:44:37 [rsHealthPoll] DBClientCursor::init call() failed
      Wed Jul 4 16:44:37 [rsHealthPoll] replSet info test1:27017 is down (or slow to respond): DBClientBase::findN: transport error: test1:27017 query:

      { replSetHeartbeat: "test", v: 1, pv: 1, checkEmpty: false, from: "test2:27017" }

      Wed Jul 4 16:44:37 [rsHealthPoll] replSet member test1:27017 is now in state DOWN
      Wed Jul 4 16:44:37 [rsMgr] replSet info electSelf 2
      Wed Jul 4 16:44:37 [rsMgr] replSet PRIMARY

      Snippet from mongodb.log of Node 3 when it formed new cluster with Node 2:
      Wed Jul 4 16:44:37 [conn190] replSet info voting yea for test2:27017 (2)
      Wed Jul 4 16:44:37 [rsHealthPoll] couldn't connect to test1:27017: couldn't connect to server test1:27017
      Wed Jul 4 16:44:38 [rsHealthPoll] replSet member test2:27017 is now in state PRIMARY
      Wed Jul 4 16:44:38 [rsMgr] replSet info two primaries (transiently)
      Wed Jul 4 16:44:44 [rsMgr] replSet info two primaries (transiently)
      Wed Jul 4 16:44:47 [rsHealthPoll] replSet info test1:27017 is down (or slow to respond): socket exception
      Wed Jul 4 16:44:47 [rsHealthPoll] replSet member test1:27017 is now in state DOWN

        1. br0.pcap
          675 kB
          Tuure Laurinolli

            Assignee:
            kristina Kristina Chodorow (Inactive)
            Reporter:
            tuure.laurinolli@portalify.com Tuure Laurinolli
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: