Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-32542

MongoDB member repeatedly ends up in unrecoverable rollback state

    • Type: Icon: Bug Bug
    • Resolution: Works as Designed
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.4.9
    • Component/s: Replication
    • None
    • Fully Compatible
    • ALL

      We have a small three member replicaset running MongoDB 3.4.9:

      • DB-1. Physical local server, Windows Server 2012, 64 GB RAM, 6 cores. Hosted in Scandinavia.
      • DB-2. Amazon EC2, Windows Server 2016, r4.2xlarge, 61 GB RAM, 8 vCPUs. Hosted in Germany.
      • Arbiter. Tiny cloud based Linux instance.

      DB-1 runs as the primary member, and has ended up in a ROLLBACK state roughly once per month ever since we set the replicaset up. A few times, the rollback has been successful, but usually it is unrecoverable, and we have to manually resync the data. We have now had 3 unrecoverable rollbacks in 8 days, and reached a point where we need to address this.

      Find below the logs from 2 of our last 3 rollbacks (grepped for REPL and a few other relevant commands). During the Dec 28 rollback, we shut DB-1 down manually after 1.5 hours since the rollback didn't progress. During the Jan 4 rollback, the DB terminated itself. The replication lag at the time of the Dec 28 incident might have been as high as ~180 seconds, while it was ~3 seconds during the Jan 4 incident.

      The rollback/ folder on DB-1 contains 436 files totalling 55 MB, but none of them are from the recent outages (they are all older than 2 months).

      Any idea why we'd end up in ROLLBACK to begin with, and perhaps more importantly, why the rollback almost never succeeds?

      Note: Timestamps between servers are in sync within +- 1 second.

      December 28 incident:

      DB-1:

      2017-12-28T11:07:29.988 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 130
      2017-12-28T11:07:29.989 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 130
      2017-12-28T11:07:29.989 I REPL     [replExecDBWorker-0] transition to SECONDARY
      2017-12-28T11:07:31.251 I REPL     [ReplicationExecutor] Member proddb2.xxxx.com:27017 is now in state PRIMARY
      2017-12-28T11:07:37.232 I REPL     [rsBackgroundSync] sync source candidate: proddb2.xxxx.com:27017
      2017-12-28T11:07:37.628 I REPL     [rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: { ts: Timestamp 1514459257000|3547, t: 129 }. source's GTE: { ts: Timestamp 1514459258000|1, t: 130 } hashes: (-2389994255667895449/3792703541455281876)
      2017-12-28T11:07:37.629 I REPL     [rsBackgroundSync] rollback 0
      2017-12-28T11:07:37.629 I REPL     [rsBackgroundSync] transition to ROLLBACK
      2017-12-28T11:07:37.640 I REPL     [rsBackgroundSync] beginning rollback
      2017-12-28T11:07:37.658 I REPL     [rsBackgroundSync] rollback 1
      2017-12-28T11:07:38.308 I REPL     [rsBackgroundSync] rollback 2 FindCommonPoint
      2017-12-28T11:07:38.341 I REPL     [rsBackgroundSync] rollback our last optime:   Dec 28 11:07:37:ddb
      2017-12-28T11:07:38.342 I REPL     [rsBackgroundSync] rollback their last optime: Dec 28 11:07:38:94
      2017-12-28T11:07:38.342 I REPL     [rsBackgroundSync] rollback diff in end of log times: -1 seconds
      2017-12-28T11:07:56.813 I REPL     [rsBackgroundSync] rollback common point is { ts: Timestamp 1514459067000|8080, t: 129 }
      2017-12-28T11:07:56.813 I REPL     [rsBackgroundSync] rollback 3 fixup
      2017-12-28T12:38:24.833 I COMMAND  [conn693343] terminating, shutdown command received
      2017-12-28T12:38:24.834 I REPL     [conn693343] shutting down replication subsystems
      2017-12-28T12:38:24.834 I REPL     [conn693343] Stopping replication reporter thread
      2017-12-28T12:38:24.834 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update to proddb2.xxxx.com:27017: CallbackCanceled: Reporter no longer valid
      2017-12-28T12:38:24.835 I REPL     [conn693343] Stopping replication fetcher thread
      2017-12-28T12:38:24.858 I REPL     [conn693343] Stopping replication applier thread
      

      DB-2:

      2017-12-28T11:07:29.584 I REPL     [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
      2017-12-28T11:07:29.584 I REPL     [ReplicationExecutor] conducting a dry run election to see if we could be elected
      2017-12-28T11:07:29.771 I REPL     [ReplicationExecutor] VoteRequester(term 129 dry run) received a yes vote from proddb-arbiter.xxxx.com:27017; response message: { term: 129, voteGranted: true, reason: "", ok: 1.0 }
      2017-12-28T11:07:29.771 I REPL     [ReplicationExecutor] dry election run succeeded, running for election
      2017-12-28T11:07:29.803 I REPL     [ReplicationExecutor] VoteRequester(term 130) received a yes vote from proddb-arbiter.xxxx.com:27017; response message: { term: 130, voteGranted: true, reason: "", ok: 1.0 }
      2017-12-28T11:07:29.803 I REPL     [ReplicationExecutor] election succeeded, assuming primary role in term 130
      2017-12-28T11:07:29.803 I REPL     [ReplicationExecutor] transition to PRIMARY
      2017-12-28T11:07:29.803 I REPL     [ReplicationExecutor] Entering primary catch-up mode.
      2017-12-28T11:07:29.884 I REPL     [ReplicationExecutor] Heartbeats updated catchup target optime to { ts: Timestamp 1514459257000|1022, t: 129 }
      2017-12-28T11:07:31.804 I REPL     [ReplicationExecutor] Catchup timed out after becoming primary.
      2017-12-28T11:07:31.804 I REPL     [ReplicationExecutor] Exited primary catch-up mode.
      2017-12-28T11:07:31.924 I REPL     [ReplicationExecutor] Member proddb1.xxxx.com:27017 is now in state SECONDARY
      2017-12-28T11:07:32.632 I REPL     [rsSync] transition to primary complete; database writes are now permitted
      2017-12-28T11:07:32.831 I REPL     [rsBackgroundSync] Replication producer stopped after oplog fetcher finished returning a batch from our sync source.  Abandoning this batch of oplog entries and re-evaluating our sync source.
      2017-12-28T11:07:33.373 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update to proddb1.xxxx.com:27017: InvalidSyncSource: Sync source was cleared. Was proddb1.xxxx.com:27017
      2017-12-28T11:07:38.329 I REPL     [ReplicationExecutor] Member proddb1.xxxx.com:27017 is now in state ROLLBACK
      2017-12-28T12:05:11.681 I REPL     [ReplicationExecutor] Error in heartbeat request to proddb1.xxxx.com:27017; ExceededTimeLimit: Operation timed out, request was RemoteCommand 443437739 -- target:proddb1.xxxx.com:27017 db:admin expDate:2017-12-28T12:05:11.680 cmd:{ replSetHeartbeat: "rs", configVersion: 66175, from: "proddb2.xxxx.com:27017", fromId: 3, term: 130 }
      2017-12-28T12:05:13.930 I REPL     [ReplicationExecutor] Member proddb1.xxxx.com:27017 is now in state ROLLBACK
      2017-12-28T12:46:23.193 I REPL     [ReplicationExecutor] Error in heartbeat request to proddb1.xxxx.com:27017; HostUnreachable: An existing connection was forcibly closed by the remote host.
      

      Arbiter:

      2017-12-28T11:07:28.769 I COMMAND  [conn173] command local.replset.election command: replSetRequestVotes { replSetRequestVotes: 1, setName: "rs", dryRun: true, term: 129, candidateIndex: 2, configVersion: 66175, lastCommittedOp: { ts: Timestamp 1514459067000|8080, t: 129 } } numYields:0 reslen:63 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { W: 1 } } } protocol:op_command 156ms
      2017-12-28T11:07:30.871 I REPL     [ReplicationExecutor] Member proddb2.xxxx.com:27017 is now in state PRIMARY
      2017-12-28T11:07:33.358 I REPL     [ReplicationExecutor] Member proddb1.xxxx.com:27017 is now in state SECONDARY
      2017-12-28T11:07:38.403 I REPL     [ReplicationExecutor] Member proddb1.xxxx.com:27017 is now in state ROLLBACK
      2017-12-28T12:46:22.078 I -        [conn165] end connection 158.174.52.213:50060 (3 connections now open)
      2017-12-28T12:46:24.473 I ASIO     [ReplicationExecutor] dropping unhealthy pooled connection to proddb1.xxxx.com:27017
      2017-12-28T12:46:24.474 I ASIO     [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
      2017-12-28T12:46:24.491 I ASIO     [ReplicationExecutor] Failed to close stream: Transport endpoint is not connected
      2017-12-28T12:46:24.491 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to proddb1.xxxx.com:27017
      2017-12-28T12:46:34.473 I REPL     [ReplicationExecutor] Error in heartbeat request to proddb1.xxxx.com:27017; ExceededTimeLimit: Couldn't get a connection within the time limit
      

      January 4 incident:

      DB-1:

      2018-01-04T11:47:15.969 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 136
      2018-01-04T11:47:15.969 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 136
      2018-01-04T11:47:15.969 I REPL     [replExecDBWorker-2] transition to SECONDARY
      2018-01-04T11:47:16.551 I REPL     [ReplicationExecutor] Member proddb2.xxxx.com:27017 is now in state PRIMARY
      2018-01-04T11:47:22.565 I REPL     [rsBackgroundSync] sync source candidate: proddb2.xxxx.com:27017
      2018-01-04T11:47:22.863 I REPL     [rsBackgroundSync] Starting rollback due to OplogStartMissing: Received an empty batch from sync source.
      2018-01-04T11:47:22.863 I REPL     [rsBackgroundSync] rollback 0
      2018-01-04T11:47:22.863 I REPL     [rsBackgroundSync] transition to ROLLBACK
      2018-01-04T11:47:22.873 I REPL     [rsBackgroundSync] beginning rollback
      2018-01-04T11:47:22.892 I REPL     [rsBackgroundSync] rollback 1
      2018-01-04T11:47:23.102 I REPL     [rsBackgroundSync] rollback 2 FindCommonPoint
      2018-01-04T11:47:23.135 I REPL     [rsBackgroundSync] rollback our last optime:   Jan  4 11:47:23:516
      2018-01-04T11:47:23.135 I REPL     [rsBackgroundSync] rollback their last optime: Jan  4 11:47:23:53
      2018-01-04T11:47:23.135 I REPL     [rsBackgroundSync] rollback diff in end of log times: 0 seconds
      2018-01-04T11:47:42.169 I REPL     [ReplicationExecutor] Error in heartbeat request to proddb-arbiter.xxxx.com:27017; ExceededTimeLimit: Operation timed out, request was RemoteCommand 1522187 -- target:proddb-arbiter.xxxx.com:27017 db:admin expDate:2018-01-04T11:47:50.168 cmd:{ replSetHeartbeat: "rs", configVersion: 66175, from: "proddb1.xxxx.com:27017", fromId: 2, term: 136 }
      2018-01-04T11:47:44.259 I REPL     [ReplicationExecutor] Member proddb-arbiter.xxxx.com:27017 is now in state ARBITER
      2018-01-04T11:47:45.785 I REPL     [rsBackgroundSync] rollback common point is { ts: Timestamp 1515066330000|1306, t: 135 }
      2018-01-04T11:47:45.785 I REPL     [rsBackgroundSync] rollback 3 fixup
      2018-01-04T11:59:36.995 I REPL     [rsBackgroundSync] rollback finished
      2018-01-04T11:59:36.995 F REPL     [rsBackgroundSync] Unable to complete rollback. A full resync may be needed: UnrecoverableRollbackError: replSet too much data to roll back @ 18753
      2018-01-04T11:59:36.996 I -        [rsBackgroundSync] Fatal Assertion 28723 at src\mongo\db\repl\rs_rollback.cpp 915
      

      DB-2:

      2018-01-04T11:47:15.179 I REPL     [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
      2018-01-04T11:47:15.179 I REPL     [ReplicationExecutor] conducting a dry run election to see if we could be elected
      2018-01-04T11:47:15.208 I REPL     [ReplicationExecutor] VoteRequester(term 135 dry run) received a yes vote from proddb-arbiter.xxxx.com:27017; response message: { term: 135, voteGranted: true, reason: "", ok: 1.0 }
      2018-01-04T11:47:15.208 I REPL     [ReplicationExecutor] dry election run succeeded, running for election
      2018-01-04T11:47:15.239 I REPL     [ReplicationExecutor] VoteRequester(term 136) received a yes vote from proddb-arbiter.xxxx.com:27017; response message: { term: 136, voteGranted: true, reason: "", ok: 1.0 }
      2018-01-04T11:47:15.239 I REPL     [ReplicationExecutor] election succeeded, assuming primary role in term 136
      2018-01-04T11:47:15.239 I REPL     [ReplicationExecutor] transition to PRIMARY
      2018-01-04T11:47:15.240 I REPL     [ReplicationExecutor] Entering primary catch-up mode.
      2018-01-04T11:47:15.423 I REPL     [ReplicationExecutor] Heartbeats updated catchup target optime to { ts: Timestamp 1515066442000|1854, t: 135 }
      2018-01-04T11:47:17.240 I REPL     [ReplicationExecutor] Catchup timed out after becoming primary.
      2018-01-04T11:47:17.240 I REPL     [ReplicationExecutor] Exited primary catch-up mode.
      2018-01-04T11:47:17.241 I REPL     [rsBackgroundSync] Replication producer stopped after oplog fetcher finished returning a batch from our sync source.  Abandoning this batch of oplog entries and re-evaluating our sync source.
      2018-01-04T11:47:17.455 I REPL     [ReplicationExecutor] Member proddb1.xxxx.com:27017 is now in state SECONDARY
      2018-01-04T11:47:18.078 I REPL     [rsSync] transition to primary complete; database writes are now permitted
      2018-01-04T11:47:18.916 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update to proddb1.xxxx.com:27017: InvalidSyncSource: Sync source was cleared. Was proddb1.xxxx.com:27017
      2018-01-04T11:47:23.557 I REPL     [ReplicationExecutor] Member proddb1.xxxx.com:27017 is now in state ROLLBACK
      2018-01-04T11:59:49.050 I REPL     [ReplicationExecutor] Error in heartbeat request to proddb1.xxxx.com:27017; ExceededTimeLimit: Couldn't get a connection within the time limit
      

      Arbiter:

      2018-01-04T11:47:15.993 I REPL     [ReplicationExecutor] Member proddb1.xxxx.com:27017 is now in state SECONDARY
      2018-01-04T11:47:18.037 I REPL     [ReplicationExecutor] Member proddb2.xxxx.com:27017 is now in state PRIMARY
      2018-01-04T11:47:25.998 I REPL     [ReplicationExecutor] Member proddb1.xxxx.com:27017 is now in state ROLLBACK
      2018-01-04T11:59:47.341 I REPL     [ReplicationExecutor] Error in heartbeat request to proddb1.xxxx.com:27017; ExceededTimeLimit: Couldn't get a connection within the time limit
      

            Assignee:
            judah.schvimer@mongodb.com Judah Schvimer
            Reporter:
            mbl54 Max Bennedich
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: