-
Type: Bug
-
Resolution: Duplicate
-
Priority: Major - P3
-
None
-
Affects Version/s: 3.2.9
-
Component/s: None
-
None
-
ALL
Hello,
Last night the primary node stepped down for reason I don't know and a secondary became a primary. The former primary immediately displayed this error:
2016-09-23T04:10:09.012+0200 I NETWORK [conn4831] end connection 10.0.0.10:40780 (3 connections now open) 2016-09-23T04:10:09.349+0200 I REPL [ReplicationExecutor] syncing from: mongodb-replica2:27017 2016-09-23T04:10:09.353+0200 I REPL [SyncSourceFeedback] setting syncSourceFeedback to mongodb-replica2:27017 2016-09-23T04:10:09.355+0200 I REPL [rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: (term: 238, timestamp: Sep 23 04:10:05:b7). source's GTE: (term: 239, timestamp: Sep 23 04:10:06:1) hashes: (298476498444083566/5675657933368738701) 2016-09-23T04:10:09.356+0200 I - [rsBackgroundSync] Fatal assertion 18750 UnrecoverableRollbackError: need to rollback, but in inconsistent state. minvalid: (term: 239, timestamp: Sep 23 04:10:06:1) > our last optime: (term: 238, timestamp: Sep 23 04:10:05:b7) 2016-09-23T04:10:09.356+0200 I - [rsBackgroundSync] ***aborting after fassert() failure
I tried this morning to start mongod on former primary but the same error appeared after few seconds.
I had this problem before but it was due to ungraceful shutdown and bad RAID array. But in this replica set, each node is on a RAID 10 SSD with lots of memory and cpu. Journaling is enabled on both mongod and file system. And again there was no ungraceful shutdown it just happened during the election even though the primary had the highest priority.
Also, I just finished a 2-day mongod --repair on the primary before this incident and I did rsync the disk image to other secondaries. So they were all identical with no corruption.
I attached logs, diagnostics and WiredTiger file for the former primary and a new primary. The incident happened at "2016-09-23T04:10".
Please let me know if you need anything else.
rs0:PRIMARY> rs.conf() { "_id" : "rs0", "version" : 468974, "protocolVersion" : NumberLong(1), "members" : [ { "_id" : 4, "host" : "mongodb-replica1:27017", "arbiterOnly" : false, "buildIndexes" : true, "hidden" : false, "priority" : 3, "tags" : { }, "slaveDelay" : NumberLong(0), "votes" : 1 }, { "_id" : 5, "host" : "mongodb-replica2:27017", "arbiterOnly" : false, "buildIndexes" : true, "hidden" : false, "priority" : 1, "tags" : { }, "slaveDelay" : NumberLong(0), "votes" : 1 }, { "_id" : 6, "host" : "mongodb-replica3:27017", "arbiterOnly" : false, "buildIndexes" : true, "hidden" : false, "priority" : 1, "tags" : { }, "slaveDelay" : NumberLong(0), "votes" : 1 } ], "settings" : { "chainingAllowed" : true, "heartbeatIntervalMillis" : 2000, "heartbeatTimeoutSecs" : 10, "electionTimeoutMillis" : 10000, "getLastErrorModes" : { }, "getLastErrorDefaults" : { "w" : 1, "wtimeout" : 0 } } }
Log from new primary:
2016-09-23T04:10:05.189+0200 I REPL [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms 2016-09-23T04:10:05.189+0200 I REPL [ReplicationExecutor] conducting a dry run election to see if we could be elected 2016-09-23T04:10:05.189+0200 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to mongodb-replica1:27017 2016-09-23T04:10:05.189+0200 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to mongodb-replica3:27017 2016-09-23T04:10:05.601+0200 I REPL [ReplicationExecutor] could not find member to sync from 2016-09-23T04:10:05.601+0200 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to mongodb-replica1:27017 2016-09-23T04:10:05.601+0200 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to mongodb-replica3:27017
- duplicates
-
SERVER-7200 use oplog as op buffer on secondaries
- Closed
- related to
-
SERVER-25848 Enforce single sync source during rollback
- Closed