Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Duplicate
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 3.2.9
Component/s: None
Labels:
None

Operating System:
ALL
Confidence Status:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

Hello,

Last night the primary node stepped down for reason I don't know and a secondary became a primary. The former primary immediately displayed this error:

2016-09-23T04:10:09.012+0200 I NETWORK  [conn4831] end connection 10.0.0.10:40780 (3 connections now open)
2016-09-23T04:10:09.349+0200 I REPL     [ReplicationExecutor] syncing from: mongodb-replica2:27017
2016-09-23T04:10:09.353+0200 I REPL     [SyncSourceFeedback] setting syncSourceFeedback to mongodb-replica2:27017
2016-09-23T04:10:09.355+0200 I REPL     [rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: (term: 238, timestamp: Sep 23 04:10:05:b7). source's GTE: (term: 239, timestamp: Sep 23 04:10:06:1) hashes: (298476498444083566/5675657933368738701)
2016-09-23T04:10:09.356+0200 I -        [rsBackgroundSync] Fatal assertion 18750 UnrecoverableRollbackError: need to rollback, but in inconsistent state. minvalid: (term: 239, timestamp: Sep 23 04:10:06:1) > our last optime: (term: 238, timestamp: Sep 23 04:10:05:b7)
2016-09-23T04:10:09.356+0200 I -        [rsBackgroundSync]

***aborting after fassert() failure

I tried this morning to start mongod on former primary but the same error appeared after few seconds.

I had this problem before but it was due to ungraceful shutdown and bad RAID array. But in this replica set, each node is on a RAID 10 SSD with lots of memory and cpu. Journaling is enabled on both mongod and file system. And again there was no ungraceful shutdown it just happened during the election even though the primary had the highest priority.

Also, I just finished a 2-day mongod --repair on the primary before this incident and I did rsync the disk image to other secondaries. So they were all identical with no corruption.

I attached logs, diagnostics and WiredTiger file for the former primary and a new primary. The incident happened at "2016-09-23T04:10".

Please let me know if you need anything else.

rs0:PRIMARY> rs.conf()
{
        "_id" : "rs0",
        "version" : 468974,
        "protocolVersion" : NumberLong(1),
        "members" : [
                {
                        "_id" : 4,
                        "host" : "mongodb-replica1:27017",
                        "arbiterOnly" : false,
                        "buildIndexes" : true,
                        "hidden" : false,
                        "priority" : 3,
                        "tags" : {

                        },
                        "slaveDelay" : NumberLong(0),
                        "votes" : 1
                },
                {
                        "_id" : 5,
                        "host" : "mongodb-replica2:27017",
                        "arbiterOnly" : false,
                        "buildIndexes" : true,
                        "hidden" : false,
                        "priority" : 1,
                        "tags" : {

                        },
                        "slaveDelay" : NumberLong(0),
                        "votes" : 1
                },
                {
                        "_id" : 6,
                        "host" : "mongodb-replica3:27017",
                        "arbiterOnly" : false,
                        "buildIndexes" : true,
                        "hidden" : false,
                        "priority" : 1,
                        "tags" : {

                        },
                        "slaveDelay" : NumberLong(0),
                        "votes" : 1
                }
        ],
        "settings" : {
                "chainingAllowed" : true,
                "heartbeatIntervalMillis" : 2000,
                "heartbeatTimeoutSecs" : 10,
                "electionTimeoutMillis" : 10000,
                "getLastErrorModes" : {

                },
                "getLastErrorDefaults" : {
                        "w" : 1,
                        "wtimeout" : 0
                }
        }
}

Log from new primary:

2016-09-23T04:10:05.189+0200 I REPL     [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
2016-09-23T04:10:05.189+0200 I REPL     [ReplicationExecutor] conducting a dry run election to see if we could be elected
2016-09-23T04:10:05.189+0200 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to mongodb-replica1:27017
2016-09-23T04:10:05.189+0200 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to mongodb-replica3:27017
2016-09-23T04:10:05.601+0200 I REPL     [ReplicationExecutor] could not find member to sync from
2016-09-23T04:10:05.601+0200 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to mongodb-replica1:27017
2016-09-23T04:10:05.601+0200 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to mongodb-replica3:27017

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

mongo-replica1.zip
93.09 MB
Sep 23 2016 09:06:53 AM UTC
mongo-replica2.zip
103.19 MB
Sep 23 2016 09:06:53 AM UTC

duplicates

SERVER-7200 use oplog as op buffer on secondaries

Closed

related to

SERVER-25848 Enforce single sync source during rollback

Closed

Assignee:: Kelsey Schubert

Reporter:: Maziyar Panahi

Participants:: Kelsey Schubert, Maziyar Panahi

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: Sep 23 2016 09:06:53 AM UTC

Updated:: Sep 30 2016 10:22:41 AM UTC

Resolved:: Sep 27 2016 03:53:19 PM UTC

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates