Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-13573

Retry rollback FindCommonPoint before failing (and fasserting)

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 2.4.10, 2.6.0
    • Component/s: Replication
    • None
    • Replication
    • ALL

      We observed in production an example of a replica set node going into a FATAL state as a result of a failed oplog query against an inaccessible primary node during the rollback 2 FindCommonPoint phase.

      There might be other instances of replication failures resulting in FATAL but this is one instance we have observed in production.

      FATAL node logs:

      [rsBackgroundSync] replSet rollback 2 FindCommonPoint
      [rsBackgroundSync] DBClientCursor::init call() failed
      [rsBackgroundSync] replSet remote oplog empty or unreadable
      [rsBackgroundSync] replSet error fatal, stopping replication
      

      Primary replica set node relinquishing its PRIMARY status:

      [rsMgr] replSet relinquishing primary state
      [rsMgr] replSet SECONDARY
      [rsMgr] replSet closing client sockets after relinquishing primary
      (fatal node tries unsuccessfully to query this node's oplog while primary is closing client connections)
      

      Health Poll logs on non-FATAL node in same replica set:

      [rsHealthPoll] replSet member (fatal node hostname:port) is now in state FATAL
      

      If there is a way to handle this case more gracefully, perhaps it might be possible to avoid going into a FATAL state.

            Assignee:
            backlog-server-repl [DO NOT USE] Backlog - Replication Team
            Reporter:
            benety.goh@mongodb.com Benety Goh
            Votes:
            1 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: