-
Type: Bug
-
Resolution: Works as Designed
-
Priority: Major - P3
-
None
-
Affects Version/s: 3.4.9
-
Component/s: Replication
-
None
-
Fully Compatible
-
ALL
We have a small three member replicaset running MongoDB 3.4.9:
- DB-1. Physical local server, Windows Server 2012, 64 GB RAM, 6 cores. Hosted in Scandinavia.
- DB-2. Amazon EC2, Windows Server 2016, r4.2xlarge, 61 GB RAM, 8 vCPUs. Hosted in Germany.
- Arbiter. Tiny cloud based Linux instance.
DB-1 runs as the primary member, and has ended up in a ROLLBACK state roughly once per month ever since we set the replicaset up. A few times, the rollback has been successful, but usually it is unrecoverable, and we have to manually resync the data. We have now had 3 unrecoverable rollbacks in 8 days, and reached a point where we need to address this.
Find below the logs from 2 of our last 3 rollbacks (grepped for REPL and a few other relevant commands). During the Dec 28 rollback, we shut DB-1 down manually after 1.5 hours since the rollback didn't progress. During the Jan 4 rollback, the DB terminated itself. The replication lag at the time of the Dec 28 incident might have been as high as ~180 seconds, while it was ~3 seconds during the Jan 4 incident.
The rollback/ folder on DB-1 contains 436 files totalling 55 MB, but none of them are from the recent outages (they are all older than 2 months).
Any idea why we'd end up in ROLLBACK to begin with, and perhaps more importantly, why the rollback almost never succeeds?
Note: Timestamps between servers are in sync within +- 1 second.
December 28 incident:
DB-1:
2017-12-28T11:07:29.988 I REPL [ReplicationExecutor] stepping down from primary, because a new term has begun: 130 2017-12-28T11:07:29.989 I REPL [ReplicationExecutor] stepping down from primary, because a new term has begun: 130 2017-12-28T11:07:29.989 I REPL [replExecDBWorker-0] transition to SECONDARY 2017-12-28T11:07:31.251 I REPL [ReplicationExecutor] Member proddb2.xxxx.com:27017 is now in state PRIMARY 2017-12-28T11:07:37.232 I REPL [rsBackgroundSync] sync source candidate: proddb2.xxxx.com:27017 2017-12-28T11:07:37.628 I REPL [rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: { ts: Timestamp 1514459257000|3547, t: 129 }. source's GTE: { ts: Timestamp 1514459258000|1, t: 130 } hashes: (-2389994255667895449/3792703541455281876) 2017-12-28T11:07:37.629 I REPL [rsBackgroundSync] rollback 0 2017-12-28T11:07:37.629 I REPL [rsBackgroundSync] transition to ROLLBACK 2017-12-28T11:07:37.640 I REPL [rsBackgroundSync] beginning rollback 2017-12-28T11:07:37.658 I REPL [rsBackgroundSync] rollback 1 2017-12-28T11:07:38.308 I REPL [rsBackgroundSync] rollback 2 FindCommonPoint 2017-12-28T11:07:38.341 I REPL [rsBackgroundSync] rollback our last optime: Dec 28 11:07:37:ddb 2017-12-28T11:07:38.342 I REPL [rsBackgroundSync] rollback their last optime: Dec 28 11:07:38:94 2017-12-28T11:07:38.342 I REPL [rsBackgroundSync] rollback diff in end of log times: -1 seconds 2017-12-28T11:07:56.813 I REPL [rsBackgroundSync] rollback common point is { ts: Timestamp 1514459067000|8080, t: 129 } 2017-12-28T11:07:56.813 I REPL [rsBackgroundSync] rollback 3 fixup 2017-12-28T12:38:24.833 I COMMAND [conn693343] terminating, shutdown command received 2017-12-28T12:38:24.834 I REPL [conn693343] shutting down replication subsystems 2017-12-28T12:38:24.834 I REPL [conn693343] Stopping replication reporter thread 2017-12-28T12:38:24.834 I REPL [SyncSourceFeedback] SyncSourceFeedback error sending update to proddb2.xxxx.com:27017: CallbackCanceled: Reporter no longer valid 2017-12-28T12:38:24.835 I REPL [conn693343] Stopping replication fetcher thread 2017-12-28T12:38:24.858 I REPL [conn693343] Stopping replication applier thread
DB-2:
2017-12-28T11:07:29.584 I REPL [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms 2017-12-28T11:07:29.584 I REPL [ReplicationExecutor] conducting a dry run election to see if we could be elected 2017-12-28T11:07:29.771 I REPL [ReplicationExecutor] VoteRequester(term 129 dry run) received a yes vote from proddb-arbiter.xxxx.com:27017; response message: { term: 129, voteGranted: true, reason: "", ok: 1.0 } 2017-12-28T11:07:29.771 I REPL [ReplicationExecutor] dry election run succeeded, running for election 2017-12-28T11:07:29.803 I REPL [ReplicationExecutor] VoteRequester(term 130) received a yes vote from proddb-arbiter.xxxx.com:27017; response message: { term: 130, voteGranted: true, reason: "", ok: 1.0 } 2017-12-28T11:07:29.803 I REPL [ReplicationExecutor] election succeeded, assuming primary role in term 130 2017-12-28T11:07:29.803 I REPL [ReplicationExecutor] transition to PRIMARY 2017-12-28T11:07:29.803 I REPL [ReplicationExecutor] Entering primary catch-up mode. 2017-12-28T11:07:29.884 I REPL [ReplicationExecutor] Heartbeats updated catchup target optime to { ts: Timestamp 1514459257000|1022, t: 129 } 2017-12-28T11:07:31.804 I REPL [ReplicationExecutor] Catchup timed out after becoming primary. 2017-12-28T11:07:31.804 I REPL [ReplicationExecutor] Exited primary catch-up mode. 2017-12-28T11:07:31.924 I REPL [ReplicationExecutor] Member proddb1.xxxx.com:27017 is now in state SECONDARY 2017-12-28T11:07:32.632 I REPL [rsSync] transition to primary complete; database writes are now permitted 2017-12-28T11:07:32.831 I REPL [rsBackgroundSync] Replication producer stopped after oplog fetcher finished returning a batch from our sync source. Abandoning this batch of oplog entries and re-evaluating our sync source. 2017-12-28T11:07:33.373 I REPL [SyncSourceFeedback] SyncSourceFeedback error sending update to proddb1.xxxx.com:27017: InvalidSyncSource: Sync source was cleared. Was proddb1.xxxx.com:27017 2017-12-28T11:07:38.329 I REPL [ReplicationExecutor] Member proddb1.xxxx.com:27017 is now in state ROLLBACK 2017-12-28T12:05:11.681 I REPL [ReplicationExecutor] Error in heartbeat request to proddb1.xxxx.com:27017; ExceededTimeLimit: Operation timed out, request was RemoteCommand 443437739 -- target:proddb1.xxxx.com:27017 db:admin expDate:2017-12-28T12:05:11.680 cmd:{ replSetHeartbeat: "rs", configVersion: 66175, from: "proddb2.xxxx.com:27017", fromId: 3, term: 130 } 2017-12-28T12:05:13.930 I REPL [ReplicationExecutor] Member proddb1.xxxx.com:27017 is now in state ROLLBACK 2017-12-28T12:46:23.193 I REPL [ReplicationExecutor] Error in heartbeat request to proddb1.xxxx.com:27017; HostUnreachable: An existing connection was forcibly closed by the remote host.
Arbiter:
2017-12-28T11:07:28.769 I COMMAND [conn173] command local.replset.election command: replSetRequestVotes { replSetRequestVotes: 1, setName: "rs", dryRun: true, term: 129, candidateIndex: 2, configVersion: 66175, lastCommittedOp: { ts: Timestamp 1514459067000|8080, t: 129 } } numYields:0 reslen:63 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { W: 1 } } } protocol:op_command 156ms 2017-12-28T11:07:30.871 I REPL [ReplicationExecutor] Member proddb2.xxxx.com:27017 is now in state PRIMARY 2017-12-28T11:07:33.358 I REPL [ReplicationExecutor] Member proddb1.xxxx.com:27017 is now in state SECONDARY 2017-12-28T11:07:38.403 I REPL [ReplicationExecutor] Member proddb1.xxxx.com:27017 is now in state ROLLBACK 2017-12-28T12:46:22.078 I - [conn165] end connection 158.174.52.213:50060 (3 connections now open) 2017-12-28T12:46:24.473 I ASIO [ReplicationExecutor] dropping unhealthy pooled connection to proddb1.xxxx.com:27017 2017-12-28T12:46:24.474 I ASIO [ReplicationExecutor] after drop, pool was empty, going to spawn some connections 2017-12-28T12:46:24.491 I ASIO [ReplicationExecutor] Failed to close stream: Transport endpoint is not connected 2017-12-28T12:46:24.491 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to proddb1.xxxx.com:27017 2017-12-28T12:46:34.473 I REPL [ReplicationExecutor] Error in heartbeat request to proddb1.xxxx.com:27017; ExceededTimeLimit: Couldn't get a connection within the time limit
January 4 incident:
DB-1:
2018-01-04T11:47:15.969 I REPL [ReplicationExecutor] stepping down from primary, because a new term has begun: 136 2018-01-04T11:47:15.969 I REPL [ReplicationExecutor] stepping down from primary, because a new term has begun: 136 2018-01-04T11:47:15.969 I REPL [replExecDBWorker-2] transition to SECONDARY 2018-01-04T11:47:16.551 I REPL [ReplicationExecutor] Member proddb2.xxxx.com:27017 is now in state PRIMARY 2018-01-04T11:47:22.565 I REPL [rsBackgroundSync] sync source candidate: proddb2.xxxx.com:27017 2018-01-04T11:47:22.863 I REPL [rsBackgroundSync] Starting rollback due to OplogStartMissing: Received an empty batch from sync source. 2018-01-04T11:47:22.863 I REPL [rsBackgroundSync] rollback 0 2018-01-04T11:47:22.863 I REPL [rsBackgroundSync] transition to ROLLBACK 2018-01-04T11:47:22.873 I REPL [rsBackgroundSync] beginning rollback 2018-01-04T11:47:22.892 I REPL [rsBackgroundSync] rollback 1 2018-01-04T11:47:23.102 I REPL [rsBackgroundSync] rollback 2 FindCommonPoint 2018-01-04T11:47:23.135 I REPL [rsBackgroundSync] rollback our last optime: Jan 4 11:47:23:516 2018-01-04T11:47:23.135 I REPL [rsBackgroundSync] rollback their last optime: Jan 4 11:47:23:53 2018-01-04T11:47:23.135 I REPL [rsBackgroundSync] rollback diff in end of log times: 0 seconds 2018-01-04T11:47:42.169 I REPL [ReplicationExecutor] Error in heartbeat request to proddb-arbiter.xxxx.com:27017; ExceededTimeLimit: Operation timed out, request was RemoteCommand 1522187 -- target:proddb-arbiter.xxxx.com:27017 db:admin expDate:2018-01-04T11:47:50.168 cmd:{ replSetHeartbeat: "rs", configVersion: 66175, from: "proddb1.xxxx.com:27017", fromId: 2, term: 136 } 2018-01-04T11:47:44.259 I REPL [ReplicationExecutor] Member proddb-arbiter.xxxx.com:27017 is now in state ARBITER 2018-01-04T11:47:45.785 I REPL [rsBackgroundSync] rollback common point is { ts: Timestamp 1515066330000|1306, t: 135 } 2018-01-04T11:47:45.785 I REPL [rsBackgroundSync] rollback 3 fixup 2018-01-04T11:59:36.995 I REPL [rsBackgroundSync] rollback finished 2018-01-04T11:59:36.995 F REPL [rsBackgroundSync] Unable to complete rollback. A full resync may be needed: UnrecoverableRollbackError: replSet too much data to roll back @ 18753 2018-01-04T11:59:36.996 I - [rsBackgroundSync] Fatal Assertion 28723 at src\mongo\db\repl\rs_rollback.cpp 915
DB-2:
2018-01-04T11:47:15.179 I REPL [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms 2018-01-04T11:47:15.179 I REPL [ReplicationExecutor] conducting a dry run election to see if we could be elected 2018-01-04T11:47:15.208 I REPL [ReplicationExecutor] VoteRequester(term 135 dry run) received a yes vote from proddb-arbiter.xxxx.com:27017; response message: { term: 135, voteGranted: true, reason: "", ok: 1.0 } 2018-01-04T11:47:15.208 I REPL [ReplicationExecutor] dry election run succeeded, running for election 2018-01-04T11:47:15.239 I REPL [ReplicationExecutor] VoteRequester(term 136) received a yes vote from proddb-arbiter.xxxx.com:27017; response message: { term: 136, voteGranted: true, reason: "", ok: 1.0 } 2018-01-04T11:47:15.239 I REPL [ReplicationExecutor] election succeeded, assuming primary role in term 136 2018-01-04T11:47:15.239 I REPL [ReplicationExecutor] transition to PRIMARY 2018-01-04T11:47:15.240 I REPL [ReplicationExecutor] Entering primary catch-up mode. 2018-01-04T11:47:15.423 I REPL [ReplicationExecutor] Heartbeats updated catchup target optime to { ts: Timestamp 1515066442000|1854, t: 135 } 2018-01-04T11:47:17.240 I REPL [ReplicationExecutor] Catchup timed out after becoming primary. 2018-01-04T11:47:17.240 I REPL [ReplicationExecutor] Exited primary catch-up mode. 2018-01-04T11:47:17.241 I REPL [rsBackgroundSync] Replication producer stopped after oplog fetcher finished returning a batch from our sync source. Abandoning this batch of oplog entries and re-evaluating our sync source. 2018-01-04T11:47:17.455 I REPL [ReplicationExecutor] Member proddb1.xxxx.com:27017 is now in state SECONDARY 2018-01-04T11:47:18.078 I REPL [rsSync] transition to primary complete; database writes are now permitted 2018-01-04T11:47:18.916 I REPL [SyncSourceFeedback] SyncSourceFeedback error sending update to proddb1.xxxx.com:27017: InvalidSyncSource: Sync source was cleared. Was proddb1.xxxx.com:27017 2018-01-04T11:47:23.557 I REPL [ReplicationExecutor] Member proddb1.xxxx.com:27017 is now in state ROLLBACK 2018-01-04T11:59:49.050 I REPL [ReplicationExecutor] Error in heartbeat request to proddb1.xxxx.com:27017; ExceededTimeLimit: Couldn't get a connection within the time limit
Arbiter:
2018-01-04T11:47:15.993 I REPL [ReplicationExecutor] Member proddb1.xxxx.com:27017 is now in state SECONDARY 2018-01-04T11:47:18.037 I REPL [ReplicationExecutor] Member proddb2.xxxx.com:27017 is now in state PRIMARY 2018-01-04T11:47:25.998 I REPL [ReplicationExecutor] Member proddb1.xxxx.com:27017 is now in state ROLLBACK 2018-01-04T11:59:47.341 I REPL [ReplicationExecutor] Error in heartbeat request to proddb1.xxxx.com:27017; ExceededTimeLimit: Couldn't get a connection within the time limit