Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Operating System:
ALL
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

Problem Description

With mongo 4.0 WT StorageEngine, frequently experiencing UnrecoverableRollbackError Crash.

Setup Details

Mongo Server Version: v4.0.27

git version: d47b151b55f286546e7c7c98888ae0577856ca20

OpenSSL version: OpenSSL 1.1.1k FIPS 25 Mar 2021

distmod: rhel80

Mongod Options: options: { net:

{ bindIpAll: true, ipv6: true, port: 27717 }

, operationProfiling: { slowOpThresholdMs: 500 }, processManagement: { fo7.pid" }, replication: { enableMajorityReadConcern: false, oplogSizeMB: 5120, replSet: "set01f" }, security: { keyFile: "/root/.dbkey" }, storage:

{ dbPath: "/var/data/sessions.1/f", *enginecheSizeGB: 22.0* }

} }, systemLog: { destination: "file", logAppend: true, path: "/var/log/mongodb-27717.log", quiet: true } }

ReplicaSet with 4 data bearing members and 1 arbiter

RepSet Settings

"protocolVersion" : NumberLong(1),

"writeConcernMajorityJournalDefault" : false,

"settings" : {

"chainingAllowed" : true,

"heartbeatIntervalMillis" : 2000,

"heartbeatTimeoutSecs" : 1,

"electionTimeoutMillis" : 10000,

"catchUpTimeoutMillis" : -1,

"catchUpTakeoverDelayMillis" : 30000,

"getLastErrorModes" :

{ }

"getLastErrorDefaults" :

{ "w" : 1, "wtimeout" : 0 }

Test Case while the crash reported

We have migrated from 3.6.17 mmapV1 storage Engine to Mongo 4.0.27 WT Storage Engine recently. The 4.0.27 Upgrade is required inorder for us to migrate to 4.2 Version. This is the First time we have been running our application with WT storage Engine. We configured WT Cache size of 22GB and oplog of size 5GB(have been using this opLog for more than 7 years).

The test case runs a delete of records at 700 TPS per replicaSet. While this test is running, we see few replicaSet are going to RECOVERING State. When we analyzed the log, we see

[replication-3928] Restarting oplog query due to error: CappedPositionLost: error in fetcher batch callback :: caused by :: CollectionScan died due tast seen record id: RecordId(7161504155264090918). Last fetched optime (with hash): { ts: Timestamp(1667417622, 805), t: 104 }[5387250820607207456]. Restarts remaining: 1

2022-11-02T19:48:13.531+0000 I REPL [rsBackgroundSync] Starting rollback due to OplogStartMissing: Our last op time fetched: { ts: Timestamp(1667417622, 805), t: 104 }. source's GTE:

{ (5387250820607207456/5740888219355370057) 2022-11-02T19:49:31.633+0000 F ROLLBACK [rsBackgroundSync] Unable to complete rollback. A full resync may be needed: UnrecoverableRollbackError: need to rollback, but unable to determine cochingDocument: reached beginning of remote oplog: \{them: nd2bwa3psm11vb:27717, theirTime: Timestamp(1667417690, 868)}

2022-11-02T19:49:31.633+0000 F - [rsBackgroundSync] Fatal Assertion 40507 at src/mongo/db/repl/rs_rollback.cpp 1567

2022-11-02T19:49:31.634+0000 F - [rsBackgroundSync] \n\n***aborting after fassert() failure\n\n

2022-11-02T19:49:34.853+0000 I CONTROL [main] ***** SERVER RESTARTED *****

When the server restarts, its recovering from a unstable checkpoint and fail to syncup with any members stating they are too stale and finally going to maintenance mode.

2022-11-02T19:49:41.746+0000 I REPL [initandlisten] Recovering from an unstable checkpoint (top of oplog: { ts: Timestamp(1667417622, 805), t: 104 }, appliedThrough:

{ ts: Timestamp(166 2022-11-02T19:49:41.746+0000 I REPL [initandlisten] Starting recovery oplog application at the appliedThrough: \{ ts: Timestamp(1667417622, 805), t: 104 }

, through the top of the oplog:

2022-11-02T19:49:41.746+0000 I REPL [initandlisten] No oplog entries to apply for recovery. Start point is at the top of the oplog.

Server logs and diagnostics.data are attached. Nd2bwa3psm12vb, nd2bwa3psm11va are the affected replicaSets. But attached all the 4 replicaMembers logs for reference.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

Mongo-logs_11_8_2022.tar
108.18 MB
Nov 07 2022 11:46:18 PM UTC

related to

SERVER-28068 Do not go into rollback due to falling off the back of your sync source's oplog

Closed

SERVER-33878 Update OplogFetcher to go into SyncSource selection on CappedPositionLost error

Closed

SERVER-23392 Increase Replication Rollback (Data) Limit

Closed

Assignee:: Yuan Fang
Reporter:: venkataramans rama
Participants:: venkataramans rama, Yuan Fang
Votes:: 1 Vote for this issue
Watchers:: 5 Start watching this issue

Created:: Nov 07 2022 11:47:57 PM UTC
Updated:: Nov 15 2022 05:15:10 PM UTC
Resolved:: Nov 15 2022 05:15:10 PM UTC

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates