Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Works as Designed
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 3.4.9
Component/s: Replication
Labels:
None

Backwards Compatibility:
Fully Compatible
Operating System:
ALL

We have a small three member replicaset running MongoDB 3.4.9:

DB-1. Physical local server, Windows Server 2012, 64 GB RAM, 6 cores. Hosted in Scandinavia.
DB-2. Amazon EC2, Windows Server 2016, r4.2xlarge, 61 GB RAM, 8 vCPUs. Hosted in Germany.
Arbiter. Tiny cloud based Linux instance.

DB-1 runs as the primary member, and has ended up in a ROLLBACK state roughly once per month ever since we set the replicaset up. A few times, the rollback has been successful, but usually it is unrecoverable, and we have to manually resync the data. We have now had 3 unrecoverable rollbacks in 8 days, and reached a point where we need to address this.

Find below the logs from 2 of our last 3 rollbacks (grepped for REPL and a few other relevant commands). During the Dec 28 rollback, we shut DB-1 down manually after 1.5 hours since the rollback didn't progress. During the Jan 4 rollback, the DB terminated itself. The replication lag at the time of the Dec 28 incident might have been as high as ~180 seconds, while it was ~3 seconds during the Jan 4 incident.

The rollback/ folder on DB-1 contains 436 files totalling 55 MB, but none of them are from the recent outages (they are all older than 2 months).

Any idea why we'd end up in ROLLBACK to begin with, and perhaps more importantly, why the rollback almost never succeeds?

Note: Timestamps between servers are in sync within +- 1 second.

December 28 incident:

DB-1:

2017-12-28T11:07:29.988 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 130
2017-12-28T11:07:29.989 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 130
2017-12-28T11:07:29.989 I REPL     [replExecDBWorker-0] transition to SECONDARY
2017-12-28T11:07:31.251 I REPL     [ReplicationExecutor] Member proddb2.xxxx.com:27017 is now in state PRIMARY
2017-12-28T11:07:37.232 I REPL     [rsBackgroundSync] sync source candidate: proddb2.xxxx.com:27017
2017-12-28T11:07:37.628 I REPL     [rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: { ts: Timestamp 1514459257000|3547, t: 129 }. source's GTE: { ts: Timestamp 1514459258000|1, t: 130 } hashes: (-2389994255667895449/3792703541455281876)
2017-12-28T11:07:37.629 I REPL     [rsBackgroundSync] rollback 0
2017-12-28T11:07:37.629 I REPL     [rsBackgroundSync] transition to ROLLBACK
2017-12-28T11:07:37.640 I REPL     [rsBackgroundSync] beginning rollback
2017-12-28T11:07:37.658 I REPL     [rsBackgroundSync] rollback 1
2017-12-28T11:07:38.308 I REPL     [rsBackgroundSync] rollback 2 FindCommonPoint
2017-12-28T11:07:38.341 I REPL     [rsBackgroundSync] rollback our last optime:   Dec 28 11:07:37:ddb
2017-12-28T11:07:38.342 I REPL     [rsBackgroundSync] rollback their last optime: Dec 28 11:07:38:94
2017-12-28T11:07:38.342 I REPL     [rsBackgroundSync] rollback diff in end of log times: -1 seconds
2017-12-28T11:07:56.813 I REPL     [rsBackgroundSync] rollback common point is { ts: Timestamp 1514459067000|8080, t: 129 }
2017-12-28T11:07:56.813 I REPL     [rsBackgroundSync] rollback 3 fixup
2017-12-28T12:38:24.833 I COMMAND  [conn693343] terminating, shutdown command received
2017-12-28T12:38:24.834 I REPL     [conn693343] shutting down replication subsystems
2017-12-28T12:38:24.834 I REPL     [conn693343] Stopping replication reporter thread
2017-12-28T12:38:24.834 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update to proddb2.xxxx.com:27017: CallbackCanceled: Reporter no longer valid
2017-12-28T12:38:24.835 I REPL     [conn693343] Stopping replication fetcher thread
2017-12-28T12:38:24.858 I REPL     [conn693343] Stopping replication applier thread

DB-2:

2017-12-28T11:07:29.584 I REPL     [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
2017-12-28T11:07:29.584 I REPL     [ReplicationExecutor] conducting a dry run election to see if we could be elected
2017-12-28T11:07:29.771 I REPL     [ReplicationExecutor] VoteRequester(term 129 dry run) received a yes vote from proddb-arbiter.xxxx.com:27017; response message: { term: 129, voteGranted: true, reason: "", ok: 1.0 }
2017-12-28T11:07:29.771 I REPL     [ReplicationExecutor] dry election run succeeded, running for election
2017-12-28T11:07:29.803 I REPL     [ReplicationExecutor] VoteRequester(term 130) received a yes vote from proddb-arbiter.xxxx.com:27017; response message: { term: 130, voteGranted: true, reason: "", ok: 1.0 }
2017-12-28T11:07:29.803 I REPL     [ReplicationExecutor] election succeeded, assuming primary role in term 130
2017-12-28T11:07:29.803 I REPL     [ReplicationExecutor] transition to PRIMARY
2017-12-28T11:07:29.803 I REPL     [ReplicationExecutor] Entering primary catch-up mode.
2017-12-28T11:07:29.884 I REPL     [ReplicationExecutor] Heartbeats updated catchup target optime to { ts: Timestamp 1514459257000|1022, t: 129 }
2017-12-28T11:07:31.804 I REPL     [ReplicationExecutor] Catchup timed out after becoming primary.
2017-12-28T11:07:31.804 I REPL     [ReplicationExecutor] Exited primary catch-up mode.
2017-12-28T11:07:31.924 I REPL     [ReplicationExecutor] Member proddb1.xxxx.com:27017 is now in state SECONDARY
2017-12-28T11:07:32.632 I REPL     [rsSync] transition to primary complete; database writes are now permitted
2017-12-28T11:07:32.831 I REPL     [rsBackgroundSync] Replication producer stopped after oplog fetcher finished returning a batch from our sync source.  Abandoning this batch of oplog entries and re-evaluating our sync source.
2017-12-28T11:07:33.373 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update to proddb1.xxxx.com:27017: InvalidSyncSource: Sync source was cleared. Was proddb1.xxxx.com:27017
2017-12-28T11:07:38.329 I REPL     [ReplicationExecutor] Member proddb1.xxxx.com:27017 is now in state ROLLBACK
2017-12-28T12:05:11.681 I REPL     [ReplicationExecutor] Error in heartbeat request to proddb1.xxxx.com:27017; ExceededTimeLimit: Operation timed out, request was RemoteCommand 443437739 -- target:proddb1.xxxx.com:27017 db:admin expDate:2017-12-28T12:05:11.680 cmd:{ replSetHeartbeat: "rs", configVersion: 66175, from: "proddb2.xxxx.com:27017", fromId: 3, term: 130 }
2017-12-28T12:05:13.930 I REPL     [ReplicationExecutor] Member proddb1.xxxx.com:27017 is now in state ROLLBACK
2017-12-28T12:46:23.193 I REPL     [ReplicationExecutor] Error in heartbeat request to proddb1.xxxx.com:27017; HostUnreachable: An existing connection was forcibly closed by the remote host.

Arbiter:

2017-12-28T11:07:28.769 I COMMAND  [conn173] command local.replset.election command: replSetRequestVotes { replSetRequestVotes: 1, setName: "rs", dryRun: true, term: 129, candidateIndex: 2, configVersion: 66175, lastCommittedOp: { ts: Timestamp 1514459067000|8080, t: 129 } } numYields:0 reslen:63 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { W: 1 } } } protocol:op_command 156ms
2017-12-28T11:07:30.871 I REPL     [ReplicationExecutor] Member proddb2.xxxx.com:27017 is now in state PRIMARY
2017-12-28T11:07:33.358 I REPL     [ReplicationExecutor] Member proddb1.xxxx.com:27017 is now in state SECONDARY
2017-12-28T11:07:38.403 I REPL     [ReplicationExecutor] Member proddb1.xxxx.com:27017 is now in state ROLLBACK
2017-12-28T12:46:22.078 I -        [conn165] end connection 158.174.52.213:50060 (3 connections now open)
2017-12-28T12:46:24.473 I ASIO     [ReplicationExecutor] dropping unhealthy pooled connection to proddb1.xxxx.com:27017
2017-12-28T12:46:24.474 I ASIO     [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2017-12-28T12:46:24.491 I ASIO     [ReplicationExecutor] Failed to close stream: Transport endpoint is not connected
2017-12-28T12:46:24.491 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to proddb1.xxxx.com:27017
2017-12-28T12:46:34.473 I REPL     [ReplicationExecutor] Error in heartbeat request to proddb1.xxxx.com:27017; ExceededTimeLimit: Couldn't get a connection within the time limit

January 4 incident:

DB-1:

2018-01-04T11:47:15.969 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 136
2018-01-04T11:47:15.969 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 136
2018-01-04T11:47:15.969 I REPL     [replExecDBWorker-2] transition to SECONDARY
2018-01-04T11:47:16.551 I REPL     [ReplicationExecutor] Member proddb2.xxxx.com:27017 is now in state PRIMARY
2018-01-04T11:47:22.565 I REPL     [rsBackgroundSync] sync source candidate: proddb2.xxxx.com:27017
2018-01-04T11:47:22.863 I REPL     [rsBackgroundSync] Starting rollback due to OplogStartMissing: Received an empty batch from sync source.
2018-01-04T11:47:22.863 I REPL     [rsBackgroundSync] rollback 0
2018-01-04T11:47:22.863 I REPL     [rsBackgroundSync] transition to ROLLBACK
2018-01-04T11:47:22.873 I REPL     [rsBackgroundSync] beginning rollback
2018-01-04T11:47:22.892 I REPL     [rsBackgroundSync] rollback 1
2018-01-04T11:47:23.102 I REPL     [rsBackgroundSync] rollback 2 FindCommonPoint
2018-01-04T11:47:23.135 I REPL     [rsBackgroundSync] rollback our last optime:   Jan  4 11:47:23:516
2018-01-04T11:47:23.135 I REPL     [rsBackgroundSync] rollback their last optime: Jan  4 11:47:23:53
2018-01-04T11:47:23.135 I REPL     [rsBackgroundSync] rollback diff in end of log times: 0 seconds
2018-01-04T11:47:42.169 I REPL     [ReplicationExecutor] Error in heartbeat request to proddb-arbiter.xxxx.com:27017; ExceededTimeLimit: Operation timed out, request was RemoteCommand 1522187 -- target:proddb-arbiter.xxxx.com:27017 db:admin expDate:2018-01-04T11:47:50.168 cmd:{ replSetHeartbeat: "rs", configVersion: 66175, from: "proddb1.xxxx.com:27017", fromId: 2, term: 136 }
2018-01-04T11:47:44.259 I REPL     [ReplicationExecutor] Member proddb-arbiter.xxxx.com:27017 is now in state ARBITER
2018-01-04T11:47:45.785 I REPL     [rsBackgroundSync] rollback common point is { ts: Timestamp 1515066330000|1306, t: 135 }
2018-01-04T11:47:45.785 I REPL     [rsBackgroundSync] rollback 3 fixup
2018-01-04T11:59:36.995 I REPL     [rsBackgroundSync] rollback finished
2018-01-04T11:59:36.995 F REPL     [rsBackgroundSync] Unable to complete rollback. A full resync may be needed: UnrecoverableRollbackError: replSet too much data to roll back @ 18753
2018-01-04T11:59:36.996 I -        [rsBackgroundSync] Fatal Assertion 28723 at src\mongo\db\repl\rs_rollback.cpp 915

DB-2:

2018-01-04T11:47:15.179 I REPL     [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
2018-01-04T11:47:15.179 I REPL     [ReplicationExecutor] conducting a dry run election to see if we could be elected
2018-01-04T11:47:15.208 I REPL     [ReplicationExecutor] VoteRequester(term 135 dry run) received a yes vote from proddb-arbiter.xxxx.com:27017; response message: { term: 135, voteGranted: true, reason: "", ok: 1.0 }
2018-01-04T11:47:15.208 I REPL     [ReplicationExecutor] dry election run succeeded, running for election
2018-01-04T11:47:15.239 I REPL     [ReplicationExecutor] VoteRequester(term 136) received a yes vote from proddb-arbiter.xxxx.com:27017; response message: { term: 136, voteGranted: true, reason: "", ok: 1.0 }
2018-01-04T11:47:15.239 I REPL     [ReplicationExecutor] election succeeded, assuming primary role in term 136
2018-01-04T11:47:15.239 I REPL     [ReplicationExecutor] transition to PRIMARY
2018-01-04T11:47:15.240 I REPL     [ReplicationExecutor] Entering primary catch-up mode.
2018-01-04T11:47:15.423 I REPL     [ReplicationExecutor] Heartbeats updated catchup target optime to { ts: Timestamp 1515066442000|1854, t: 135 }
2018-01-04T11:47:17.240 I REPL     [ReplicationExecutor] Catchup timed out after becoming primary.
2018-01-04T11:47:17.240 I REPL     [ReplicationExecutor] Exited primary catch-up mode.
2018-01-04T11:47:17.241 I REPL     [rsBackgroundSync] Replication producer stopped after oplog fetcher finished returning a batch from our sync source.  Abandoning this batch of oplog entries and re-evaluating our sync source.
2018-01-04T11:47:17.455 I REPL     [ReplicationExecutor] Member proddb1.xxxx.com:27017 is now in state SECONDARY
2018-01-04T11:47:18.078 I REPL     [rsSync] transition to primary complete; database writes are now permitted
2018-01-04T11:47:18.916 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update to proddb1.xxxx.com:27017: InvalidSyncSource: Sync source was cleared. Was proddb1.xxxx.com:27017
2018-01-04T11:47:23.557 I REPL     [ReplicationExecutor] Member proddb1.xxxx.com:27017 is now in state ROLLBACK
2018-01-04T11:59:49.050 I REPL     [ReplicationExecutor] Error in heartbeat request to proddb1.xxxx.com:27017; ExceededTimeLimit: Couldn't get a connection within the time limit

Arbiter:

2018-01-04T11:47:15.993 I REPL     [ReplicationExecutor] Member proddb1.xxxx.com:27017 is now in state SECONDARY
2018-01-04T11:47:18.037 I REPL     [ReplicationExecutor] Member proddb2.xxxx.com:27017 is now in state PRIMARY
2018-01-04T11:47:25.998 I REPL     [ReplicationExecutor] Member proddb1.xxxx.com:27017 is now in state ROLLBACK
2018-01-04T11:59:47.341 I REPL     [ReplicationExecutor] Error in heartbeat request to proddb1.xxxx.com:27017; ExceededTimeLimit: Couldn't get a connection within the time limit

Assignee:: Judah Schvimer

Reporter:: Max Bennedich

Participants:: Judah Schvimer, Max Bennedich

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: Jan 04 2018 03:33:52 PM UTC

Updated:: Oct 27 2023 01:54:02 PM UTC

Resolved:: Jan 26 2018 02:51:35 PM UTC

Details

Description

Attachments

Activity

People

Dates