-
Type: Bug
-
Resolution: Duplicate
-
Priority: Major - P3
-
None
-
Affects Version/s: 3.2.8
-
Component/s: Replication
-
None
-
ALL
-
After rebooting a system, the database which was currently Primary lost an election to the system which is now the primary. During that process, and in the seconds that followed, the system encountered an error, got stuck in "RECOVERING" and then crashed.
DAY:15:38.702-0500 I REPL [ReplicationExecutor] stepping down from primary, because a new term has begun: 21 DAY:15:38.703-0500 I REPL [replExecDBWorker-1] transition to SECONDARY DAY:15:38.710-0500 I NETWORK [conn244] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [5.5.5.12:58244] DAY:15:39.248-0500 I REPL [ReplicationExecutor] Member mongo01.domain.us:27017 is now in state PRIMARY DAY:15:39.694-0500 I REPL [ReplicationExecutor] syncing from: mongo01.domain.us:27017 DAY:15:39.696-0500 I REPL [SyncSourceFeedback] setting syncSourceFeedback to mongo01.domain.us:27017 DAY:15:39.697-0500 I ASIO [NetworkInterfaceASIO-BGSync-0] Successfully connected to mongo01.domain.us:27017 DAY:15:39.698-0500 I REPL [rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: (term: 20, timestamp: Jul 12 22:15:38:1ce). source's GTE: (term: 21, timestamp: Jul 12 22:15:38:1ce) hashes: (7825714044400008130/-1471538608383373916) DAY:15:39.698-0500 I REPL [rsBackgroundSync] beginning rollback DAY:15:39.698-0500 I REPL [rsBackgroundSync] rollback 0 DAY:15:39.698-0500 I REPL [ReplicationExecutor] transition to ROLLBACK DAY:15:39.699-0500 I REPL [rsBackgroundSync] rollback 1 DAY:15:39.700-0500 I REPL [rsBackgroundSync] rollback 2 FindCommonPoint DAY:15:39.700-0500 I REPL [rsBackgroundSync] rollback our last optime: Jul 12 22:15:38:1ce DAY:15:39.700-0500 I REPL [rsBackgroundSync] rollback their last optime: Jul 12 22:15:39:167 DAY:15:39.700-0500 I REPL [rsBackgroundSync] rollback diff in end of log times: -1 seconds DAY:15:39.707-0500 I QUERY [conn545] assertion 13436 cannot read from oplog collection while in rollback ns:local.oplog.rs query:{} DAY:15:39.707-0500 I QUERY [conn545] ntoskip:0 ntoreturn:1 DAY:15:39.809-0500 I QUERY [conn547] assertion 13436 cannot read from oplog collection while in rollback ns:local.oplog.rs query:{} DAY:15:39.809-0500 I QUERY [conn547] ntoskip:0 ntoreturn:1 DAY:15:39.907-0500 I REPL [rsBackgroundSync] rollback 3 fixup DAY:15:39.908-0500 I REPL [rsBackgroundSync] rollback 3.5 DAY:15:39.908-0500 I REPL [rsBackgroundSync] rollback 4 n:2 DAY:15:39.908-0500 I REPL [rsBackgroundSync] minvalid=(term: 21, timestamp: Jul 12 22:15:39:1df) DAY:15:39.908-0500 I REPL [rsBackgroundSync] rollback 4.6 DAY:15:39.908-0500 I REPL [rsBackgroundSync] rollback 4.7 DAY:15:39.909-0500 I REPL [rsBackgroundSync] rollback 5 d:1 u:2 DAY:15:39.909-0500 I REPL [rsBackgroundSync] rollback 6 DAY:15:39.910-0500 I REPL [rsBackgroundSync] rollback done DAY:15:39.910-0500 I REPL [rsBackgroundSync] rollback finished DAY:15:39.910-0500 I NETWORK [SyncSourceFeedback] Socket say send() errno:9 Bad file descriptor 5.5.6.11:27017 DAY:15:39.910-0500 I REPL [SyncSourceFeedback] SyncSourceFeedback error sending update: socket exception [SEND_ERROR] for 5.5.6.11:27017 DAY:15:39.910-0500 I REPL [SyncSourceFeedback] updateUpstream failed: Location9001: socket exception [SEND_ERROR] for 5.5.6.11:27017, will retry DAY:15:39.910-0500 I QUERY [conn548] assertion 13436 cannot read from oplog collection while in rollback ns:local.oplog.rs query:{} DAY:15:39.910-0500 I QUERY [conn548] ntoskip:0 ntoreturn:1 DAY:15:39.920-0500 I REPL [ReplicationExecutor] transition to RECOVERING DAY:15:39.920-0500 I REPL [ReplicationExecutor] syncing from: mongo03.domain.us:27017 DAY:15:39.921-0500 I REPL [SyncSourceFeedback] setting syncSourceFeedback to mongo03.domain.us:27017 DAY:15:39.922-0500 I ASIO [NetworkInterfaceASIO-BGSync-0] Successfully connected to mongo03.domain.us:27017 DAY:16:09.931-0500 I REPL [ReplicationExecutor] re-evaluating sync source because our current sync source's most recent OpTime is (term: 20, timestamp: Jul 12 22:15:38:1cc) which is more than 30s behind member mongo01.domain.us:27017 whose most recent OpTime is (term: 21, timestamp: Jul 12 22:16:09:a8) DAY:16:09.932-0500 I REPL [ReplicationExecutor] syncing from: mongo01.domain.us:27017 DAY:16:09.933-0500 I REPL [SyncSourceFeedback] setting syncSourceFeedback to mongo01.domain.us:27017 DAY:16:09.934-0500 I REPL [rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: (term: 20, timestamp: Jul 12 22:15:38:1cc). source's GTE: (term: 21, timestamp: Jul 12 22:15:38:1cd) hashes: (1087207337565601823/-6762428419615138830) DAY:16:09.935-0500 I - [rsBackgroundSync] Fatal assertion 18750 UnrecoverableRollbackError: need to rollback, but in inconsistent state. minvalid: (term: 21, timestamp: Jul 12 22:15:39:1df) > our last optime: (term: 20, timestamp: Jul 12 22:15:38:1cc) DAY:16:09.935-0500 I - [rsBackgroundSync] ***aborting after fassert() failure
Followed by this on restart:
DAY::21:36.465-0500 I FTDC [initandlisten] Initializing full-time diagnostic data capture with directory '/mongodb/data/wt/diagnostic.data' DAY::21:36.465-0500 I NETWORK [HostnameCanonicalizationWorker] Starting hostname canonicalization worker DAY::21:36.467-0500 I NETWORK [initandlisten] waiting for connections on port 27017 DAY::21:36.470-0500 I REPL [ReplicationExecutor] New replica set config in use: { _id: "OnspringCavern01", version: 8, protocolVersion: 1, members: [ { _id: 0, host: "mongo01.domain.us:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 3.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 1, host: "mongo02.domain.us:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 2.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 2, host: "mongo03.domain.us:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatIntervalMillis: 1000, heartbeatTimeoutSecs: 5, electionTimeoutMillis: 6000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 }, replicaSetId: ObjectId('57857435bb2a73b423c7eb4c') } } DAY::21:36.470-0500 I REPL [ReplicationExecutor] This node is mongo02.domain.us:27017 in the config DAY::21:36.470-0500 I REPL [ReplicationExecutor] transition to STARTUP2 DAY::21:36.470-0500 I REPL [ReplicationExecutor] Starting replication applier threads DAY::21:36.470-0500 I REPL [ReplicationExecutor] transition to RECOVERING DAY::21:36.471-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to mongo01.domain.us:27017 DAY::21:36.472-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to mongo03.domain.us:27017 DAY::21:36.476-0500 I REPL [ReplicationExecutor] New replica set config in use: { _id: "OnspringCavern01", version: 9, protocolVersion: 1, members: [ { _id: 0, host: "mongo01.domain.us:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 3.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 1, host: "mongo02.domain.us:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 2.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 2, host: "mongo03.domain.us:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: false, heartbeatIntervalMillis: 1000, heartbeatTimeoutSecs: 5, electionTimeoutMillis: 6000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 }, replicaSetId: ObjectId('57857435bb2a73b423c7eb4c') } } DAY::21:36.476-0500 I REPL [ReplicationExecutor] This node is mongo02.domain.us:27017 in the config DAY::21:36.477-0500 I REPL [ReplicationExecutor] Member mongo03.domain.us:27017 is now in state SECONDARY DAY::21:36.477-0500 I REPL [ReplicationExecutor] Member mongo01.domain.us:27017 is now in state PRIMARY DAY::21:37.002-0500 I FTDC [ftdc] Unclean full-time diagnostic data capture shutdown detected, found interim file, some metrics may have been lost. OK DAY::21:37.471-0500 I REPL [ReplicationExecutor] syncing from primary: mongo01.domain.us:27017 DAY::21:37.475-0500 I REPL [SyncSourceFeedback] setting syncSourceFeedback to mongo01.domain.us:27017 DAY::21:37.476-0500 I ASIO [NetworkInterfaceASIO-BGSync-0] Successfully connected to mongo01.domain.us:27017 DAY::21:37.477-0500 I REPL [rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: (term: 20, timestamp: Jul 12 22:15:38:1cc). source's GTE: (term: 21, timestamp: Jul 12 22:15:38:1cd) hashes: (1087207337565601823/-6762428419615138830) DAY::21:37.477-0500 I - [rsBackgroundSync] Fatal assertion 18750 UnrecoverableRollbackError: need to rollback, but in inconsistent state. minvalid: (term: 21, timestamp: Jul 12 22:15:39:1df) > our last optime: (term: 20, timestamp: Jul 12 22:15:38:1cc) DAY::21:37.477-0500 I - [rsBackgroundSync] ***aborting after fassert() failure
- duplicates
-
SERVER-25145 During rollback (or w/minvalid invalid) select sync source based on end OpTime
- Closed
-
SERVER-7200 use oplog as op buffer on secondaries
- Closed
- related to
-
SERVER-25848 Enforce single sync source during rollback
- Closed