Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-9765

Two primaries should cause the earlier-elected one to step down

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • 2.6.0-rc0, 2.7.0
    • Affects Version/s: 2.4.3
    • Component/s: Replication
    • Environment:
      AMD Turion(tm) II Dual-Core Mobile M540, 4GB RAM, Ubuntu 13.04, kernel 3.5.0-28-generic
    • Fully Compatible
    • ALL
    • Hide
      1. Create replica set with 3 nodes (1 primary, 2 secondary).
      2. Insert some data to update optime.
      3. On primary simulate network problem blocking connection on firewall.
        iptables -I INPUT -p tcp --dport 27017 -j DROP
      4. Wait to designate new primary
      5. Insert some new data.
      6. On old primary simulate bringing up the network, by allowing connection on firewall.
        iptables -I INPUT -p tcp --dport 27017 -j ACCEPT
      7. Wait till new primary is being elected and optime will sync on all nodes.
      8. Check if data from #5 are available.
      Show
      Create replica set with 3 nodes (1 primary, 2 secondary). Insert some data to update optime. On primary simulate network problem blocking connection on firewall. iptables -I INPUT -p tcp --dport 27017 -j DROP Wait to designate new primary Insert some new data. On old primary simulate bringing up the network, by allowing connection on firewall. iptables -I INPUT -p tcp --dport 27017 -j ACCEPT Wait till new primary is being elected and optime will sync on all nodes. Check if data from #5 are available.

      Hello,
      Hope this is not a duplicated issue.

      Let's assume there are 3 nodes in replica set (#1, #2, #3).

      > rs.status()
      {
              "set" : "set0",
              "date" : ISODate("2013-05-23T12:48:07Z"),
              "myState" : 2,
              "syncingTo" : "krogon:27001",
              "members" : [
                      {
                              "_id" : 0,
                              "name" : "krogon:27001",
                              "health" : 1,
                              "state" : 1,
                              "stateStr" : "PRIMARY",
                              "uptime" : 29,
                              "optime" : {
                                      "t" : 1369313180,
                                      "i" : 2
                              },
                              "optimeDate" : ISODate("2013-05-23T12:46:20Z"),
                              "lastHeartbeat" : ISODate("2013-05-23T12:48:06Z"),
                              "lastHeartbeatRecv" : ISODate("2013-05-23T12:48:07Z"),
                              "pingMs" : 0
                      },
                      {
                              "_id" : 1,
                              "name" : "krogon:27002",
                              "health" : 1,
                              "state" : 2,
                              "stateStr" : "SECONDARY",
                              "uptime" : 29,
                              "optime" : {
                                      "t" : 1369313180,
                                      "i" : 2
                              },
                              "optimeDate" : ISODate("2013-05-23T12:46:20Z"),
                              "lastHeartbeat" : ISODate("2013-05-23T12:48:06Z"),
                              "lastHeartbeatRecv" : ISODate("1970-01-01T00:00:00Z"),
                              "pingMs" : 0,
                              "lastHeartbeatMessage" : "syncing to: krogon:27001",
                              "syncingTo" : "krogon:27001"
                      },
                      {
                              "_id" : 2,
                              "name" : "krogon:27003",
                              "health" : 1,
                              "state" : 2,
                              "stateStr" : "SECONDARY",
                              "uptime" : 31,
                              "optime" : {
                                      "t" : 1369313180,
                                      "i" : 2
                              },
                              "optimeDate" : ISODate("2013-05-23T12:46:20Z"),
                              "errmsg" : "syncing to: krogon:27001",
                              "self" : true
                      }
              ],
              "ok" : 1
      }
      

      When there is a network connectivity problem to primary node (#1), election is taking place and new primary is being designed (#3). Then inserting new data will update optime on #2 and #3. Old primary is alive but cannot communicate with other nodes.

      > rs.status()
      {
              "set" : "set0",
              "date" : ISODate("2013-05-23T12:55:44Z"),
              "myState" : 1,
              "members" : [
                      {
                              "_id" : 0,
                              "name" : "krogon:27001",
                              "health" : 0,
                              "state" : 8,
                              "stateStr" : "(not reachable/healthy)",
                              "uptime" : 0,
                              "optime" : {
                                      "t" : 1369313180,
                                      "i" : 2
                              },
                              "optimeDate" : ISODate("2013-05-23T12:46:20Z"),
                              "lastHeartbeat" : ISODate("2013-05-23T12:55:13Z"),
                              "lastHeartbeatRecv" : ISODate("2013-05-23T12:55:43Z"),
                              "pingMs" : 0
                      },
                      {
                              "_id" : 1,
                              "name" : "krogon:27002",
                              "health" : 1,
                              "state" : 2,
                              "stateStr" : "SECONDARY",
                              "uptime" : 486,
                              "optime" : {
                                      "t" : 1369313742,
                                      "i" : 1
                              },
                              "optimeDate" : ISODate("2013-05-23T12:55:42Z"),
                              "lastHeartbeat" : ISODate("2013-05-23T12:55:43Z"),
                              "lastHeartbeatRecv" : ISODate("1970-01-01T00:00:00Z"),
                              "pingMs" : 0,
                              "syncingTo" : "krogon:27003"
                      },
                      {
                              "_id" : 2,
                              "name" : "krogon:27003",
                              "health" : 1,
                              "state" : 1,
                              "stateStr" : "PRIMARY",
                              "uptime" : 488,
                              "optime" : {
                                      "t" : 1369313742,
                                      "i" : 1
                              },
                              "optimeDate" : ISODate("2013-05-23T12:55:42Z"),
                              "self" : true
                      }
              ],
              "ok" : 1
      }
      

      Log from #1:

      [rsHealthPoll] replset info krogon:27002 thinks that we are down
      [rsHealthPoll] replset info krogon:27003 thinks that we are down
      

      Log from #2:

      [rsHealthPoll] replset info krogon:27001 heartbeat failed, retrying
      [rsHealthPoll] replSet info krogon:27001 is down (or slow to respond): 
      [rsHealthPoll] replSet member krogon:27001 is now in state DOWN
      [rsMgr] not electing self, krogon:27003 would veto with 'krogon:27002 is trying to elect itself but krogon:27001 is already primary and more up-to-date'
      [rsMgr] not electing self, krogon:27003 would veto with 'krogon:27002 is trying to elect itself but krogon:27001 is already primary and more up-to-date'
      [rsMgr] not electing self, krogon:27003 would veto with 'krogon:27002 is trying to elect itself but krogon:27001 is already primary and more up-to-date'
      [rsHealthPoll] replset info krogon:27001 heartbeat failed, retrying
      [rsMgr] not electing self, krogon:27003 would veto with 'krogon:27002 is trying to elect itself but krogon:27001 is already primary and more up-to-date'
      [rsBackgroundSync] Socket recv() timeout  127.0.1.1:27001
      [rsBackgroundSync] SocketException: remote: 127.0.1.1:27001 error: 9001 socket exception [3] server [127.0.1.1:27001] 
      [rsBackgroundSync] replSet db exception in producer: 10278 dbclient error communicating with server: krogon:27001
      [conn9] replSet info voting yea for krogon:27003 (2)
      [rsMgr] replSet not trying to elect self as responded yea to someone else recently
      [rsHealthPoll] replset info krogon:27001 heartbeat failed, retrying
      [rsMgr] not electing self, krogon:27003 would veto with 'I am already primary, krogon:27002 can try again once I've stepped down'
      [rsHealthPoll] replSet member krogon:27003 is now in state PRIMARY
      [rsBackgroundSync] replSet syncing to: krogon:27003
      [rsHealthPoll] replset info krogon:27001 heartbeat failed, retrying
      [rsSyncNotifier] replset setting oplog notifier to krogon:27003
      [rsHealthPoll] replset info krogon:27001 heartbeat failed, retrying
      [rsHealthPoll] couldn't connect to krogon:27001: couldn't connect to server krogon:27001
      

      Logs from #3:

      [rsHealthPoll] couldn't connect to krogon:27001: couldn't connect to server krogon:27001
      [rsHealthPoll] replSet info krogon:27001 is down (or slow to respond): 
      [rsHealthPoll] replSet member krogon:27001 is now in state DOWN
      [rsBackgroundSync] Socket recv() timeout  127.0.1.1:27001
      [rsBackgroundSync] SocketException: remote: 127.0.1.1:27001 error: 9001 socket exception [3] server [127.0.1.1:27001] 
      [rsBackgroundSync] replSet db exception in producer: 10278 dbclient error communicating with server: krogon:27001
      [rsMgr] replSet info electSelf 2
      [rsMgr] replSet PRIMARY
      [rsHealthPoll] couldn't connect to krogon:27001: couldn't connect to server krogon:27001
      

      Now the connectivity is back. Election is taking part. Old primary should (#1) rollback data which was not replicated and become secondary. After become secondary it should download data from new master (#3), which has higher optime.

      Hovewer during election old primary (#1) becomes primary again. New primary (#3) after swiching to secondary is syncing to old primary (#1) with older optime which causes data loss.

      > rs.status()
      {
              "set" : "set0",
              "date" : ISODate("2013-05-23T12:59:37Z"),
              "myState" : 2,
              "syncingTo" : "krogon:27001",
              "members" : [
                      {
                              "_id" : 0,
                              "name" : "krogon:27001",
                              "health" : 1,
                              "state" : 1,
                              "stateStr" : "PRIMARY",
                              "uptime" : 59,
                              "optime" : {
                                      "t" : 1369313180,
                                      "i" : 2
                              },
                              "optimeDate" : ISODate("2013-05-23T12:46:20Z"),
                              "lastHeartbeat" : ISODate("2013-05-23T12:59:36Z"),
                              "lastHeartbeatRecv" : ISODate("2013-05-23T12:59:37Z"),
                              "pingMs" : 0
                      },
                      {
                              "_id" : 1,
                              "name" : "krogon:27002",
                              "health" : 1,
                              "state" : 2,
                              "stateStr" : "SECONDARY",
                              "uptime" : 719,
                              "optime" : {
                                      "t" : 1369313180,
                                      "i" : 2
                              },
                              "optimeDate" : ISODate("2013-05-23T12:46:20Z"),
                              "lastHeartbeat" : ISODate("2013-05-23T12:59:37Z"),
                              "lastHeartbeatRecv" : ISODate("1970-01-01T00:00:00Z"),
                              "pingMs" : 0,
                              "syncingTo" : "krogon:27001"
                      },
                      {
                              "_id" : 2,
                              "name" : "krogon:27003",
                              "health" : 1,
                              "state" : 2,
                              "stateStr" : "SECONDARY",
                              "uptime" : 721,
                              "optime" : {
                                      "t" : 1369313180,
                                      "i" : 2
                              },
                              "optimeDate" : ISODate("2013-05-23T12:46:20Z"),
                              "errmsg" : "syncing to: krogon:27001",
                              "self" : true
                      }
              ],
              "ok" : 1
      }
      

      LOGS from #3:

      [rsHealthPoll] couldn't connect to krogon:27001: couldn't connect to server krogon:27001
      [rsHealthPoll] replSet member krogon:27001 is up
      [rsHealthPoll] replSet member krogon:27001 is now in state PRIMARY
      [rsMgr] replSet relinquishing primary state
      [rsMgr] replSet SECONDARY
      [rsMgr] replSet closing client sockets after relinquishing primary
      [rsBackgroundSync] replSet syncing to: krogon:27001
      [rsHealthPoll] replset info krogon:27001 heartbeat failed, retrying
      [conn14] SocketException handling request, closing client connection: 9001 socket exception [2] server [127.0.0.1:53226] 
      [rsBackgroundSync] replSet we are ahead of the sync source, will try to roll back
      [rsBackgroundSync] replSet rollback 0
      [rsBackgroundSync] replSet ROLLBACK
      [rsBackgroundSync] replSet rollback 1
      [rsBackgroundSync] replSet rollback 2 FindCommonPoint
      [rsBackgroundSync] replSet info rollback our last optime:   May 23 14:56:36:1
      [rsBackgroundSync] replSet info rollback their last optime: May 23 14:46:20:2
      [rsBackgroundSync] replSet info rollback diff in end of log times: 616 seconds
      [rsBackgroundSync] replSet rollback found matching events at May 23 14:46:20:2
      [rsBackgroundSync] replSet rollback findcommonpoint scanned : 142
      [rsBackgroundSync] replSet replSet rollback 3 fixup
      [rsBackgroundSync] replSet rollback 3.5
      [rsBackgroundSync] replSet rollback 4 n:100
      [rsBackgroundSync] replSet minvalid=May 23 14:46:20 519e0f9c:2
      [rsBackgroundSync] replSet rollback 4.6
      [rsBackgroundSync] replSet rollback 4.7
      [rsBackgroundSync] replSet rollback 5 d:0 u:100
      [rsBackgroundSync] replSet rollback 6
      [rsBackgroundSync] replSet rollback 7
      [rsBackgroundSync] replSet rollback done
      [rsBackgroundSync] replSet RECOVERING
      [rsBackgroundSync] replSet syncing to: krogon:27001
      [rsSync] replSet SECONDARY
      [rsBackgroundSync] replSet syncing to: krogon:27001
      

        1. node1.log
          54 kB
        2. node2.log
          12 kB
        3. node3.log
          11 kB

            Assignee:
            milkie@mongodb.com Eric Milkie
            Reporter:
            krogon_tsi krogon_tsi
            Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

              Created:
              Updated:
              Resolved: