-
Type: Bug
-
Resolution: Done
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
ALL
Problem Description
With mongo 4.0 WT StorageEngine, frequently experiencing UnrecoverableRollbackError Crash.
Setup Details
Mongo Server Version: v4.0.27
git version: d47b151b55f286546e7c7c98888ae0577856ca20
OpenSSL version: OpenSSL 1.1.1k FIPS 25 Mar 2021
distmod: rhel80
Mongod Options: options: { net:
{ bindIpAll: true, ipv6: true, port: 27717 }, operationProfiling: { slowOpThresholdMs: 500 }, processManagement: { fo7.pid" }, replication: { enableMajorityReadConcern: false, oplogSizeMB: 5120, replSet: "set01f" }, security: { keyFile: "/root/.dbkey" }, storage:
{ dbPath: "/var/data/sessions.1/f", *enginecheSizeGB: 22.0* }} }, systemLog: { destination: "file", logAppend: true, path: "/var/log/mongodb-27717.log", quiet: true } }
ReplicaSet with 4 data bearing members and 1 arbiter
RepSet Settings
"protocolVersion" : NumberLong(1),
"writeConcernMajorityJournalDefault" : false,
"settings" : {
"chainingAllowed" : true,
"heartbeatIntervalMillis" : 2000,
"heartbeatTimeoutSecs" : 1,
"electionTimeoutMillis" : 10000,
"catchUpTimeoutMillis" : -1,
"catchUpTakeoverDelayMillis" : 30000,
"getLastErrorModes" :
{ },
"getLastErrorDefaults" :
{ "w" : 1, "wtimeout" : 0 },
Test Case while the crash reported
We have migrated from 3.6.17 mmapV1 storage Engine to Mongo 4.0.27 WT Storage Engine recently. The 4.0.27 Upgrade is required inorder for us to migrate to 4.2 Version. This is the First time we have been running our application with WT storage Engine. We configured WT Cache size of 22GB and oplog of size 5GB(have been using this opLog for more than 7 years).
The test case runs a delete of records at 700 TPS per replicaSet. While this test is running, we see few replicaSet are going to RECOVERING State. When we analyzed the log, we see
[replication-3928] Restarting oplog query due to error: CappedPositionLost: error in fetcher batch callback :: caused by :: CollectionScan died due tast seen record id: RecordId(7161504155264090918). Last fetched optime (with hash): { ts: Timestamp(1667417622, 805), t: 104 }[5387250820607207456]. Restarts remaining: 1
2022-11-02T19:48:13.531+0000 I REPL [rsBackgroundSync] Starting rollback due to OplogStartMissing: Our last op time fetched: { ts: Timestamp(1667417622, 805), t: 104 }. source's GTE:
{ (5387250820607207456/5740888219355370057) 2022-11-02T19:49:31.633+0000 F ROLLBACK [rsBackgroundSync] Unable to complete rollback. A full resync may be needed: UnrecoverableRollbackError: need to rollback, but unable to determine cochingDocument: reached beginning of remote oplog: \{them: nd2bwa3psm11vb:27717, theirTime: Timestamp(1667417690, 868)}2022-11-02T19:49:31.633+0000 F - [rsBackgroundSync] Fatal Assertion 40507 at src/mongo/db/repl/rs_rollback.cpp 1567
2022-11-02T19:49:31.634+0000 F - [rsBackgroundSync] \n\n***aborting after fassert() failure\n\n
2022-11-02T19:49:34.853+0000 I CONTROL [main] ***** SERVER RESTARTED *****
When the server restarts, its recovering from a unstable checkpoint and fail to syncup with any members stating they are too stale and finally going to maintenance mode.
2022-11-02T19:49:41.746+0000 I REPL [initandlisten] Recovering from an unstable checkpoint (top of oplog: { ts: Timestamp(1667417622, 805), t: 104 }, appliedThrough:
{ ts: Timestamp(166 2022-11-02T19:49:41.746+0000 I REPL [initandlisten] Starting recovery oplog application at the appliedThrough: \{ ts: Timestamp(1667417622, 805), t: 104 }, through the top of the oplog:
2022-11-02T19:49:41.746+0000 I REPL [initandlisten] No oplog entries to apply for recovery. Start point is at the top of the oplog.
Server logs and diagnostics.data are attached. Nd2bwa3psm12vb, nd2bwa3psm11va are the affected replicaSets. But attached all the 4 replicaMembers logs for reference.
- related to
-
SERVER-28068 Do not go into rollback due to falling off the back of your sync source's oplog
- Closed
-
SERVER-33878 Update OplogFetcher to go into SyncSource selection on CappedPositionLost error
- Closed
-
SERVER-23392 Increase Replication Rollback (Data) Limit
- Closed