-
Type: Bug
-
Resolution: Cannot Reproduce
-
Priority: Critical - P2
-
None
-
Affects Version/s: 4.0.1, 4.0.4
-
Component/s: Replication
-
ALL
-
A series of weird happenings have ended up in a situation triggering Fatal Error 40292 in mongo/db/repl/replication_recovery.cpp line 133.
We have a PSA configuration. The entire issue started when we our DB approached (or hit) the disk limit. We resized the disk about mid October. From there on we've not been able to have a stable system. The servers keep going down.
I got involved two days ago, so I don't have the entire history.
I found that our system was triggering WT-4335, with version 4.0.1. I upgraded to 4.0.4 yesterday. I am guessing now the fatal error is triggering at the root cause. (The WT-4335 was triggering after all the oplog replay, now the error is at the start itself.)
2018-11-12T05:41:22.612+0000 I FTDC [initandlisten] Initializing full-time diagnostic data capture with directory '/mongo/rs0_8/diagnostic.data'
2018-11-12T05:41:22.614+0000 I REPL [initandlisten] Rollback ID is 12
2018-11-12T05:41:22.614+0000 I REPL [initandlisten] Recovering from stable timestamp: Timestamp(1539794142, 116) (top of oplog: { ts: Timestamp(1541830881, 1), t: 85 }, appliedThrough: { ts: Timestamp(0, 0), t: -1 }, TruncateAfter: Timestamp(0, 0))
2018-11-12T05:41:22.614+0000 I REPL [initandlisten] Starting recovery oplog application at the stable timestamp: Timestamp(1539794142, 116)
2018-11-12T05:41:22.614+0000 I REPL [initandlisten] Replaying stored operations from { : Timestamp(1539794142, 116) } (exclusive) to { : Timestamp(1541830881, 1) } (inclusive).
{{2018-11-12T05:41:22.617+0000 F REPL [initandlisten] Oplog entry at { : Timestamp(1539794142, 116) } is missing; actual entry found is { : Timestamp(1539794616, 312) }}}
2018-11-12T05:41:22.617+0000 F - [initandlisten] Fatal Assertion 40292 at src/mongo/db/repl/replication_recovery.cpp 134
{{2018-11-12T05:41:22.617+0000 F - [initandlisten] }}
For reference, here is log with 4.0.1 (prior to the above error).
2018-11-11T10:23:54.182+0000 I ROLLBACK [rsBackgroundSync] Rolling back to the stable timestamp. StableTimestamp: Timestamp(1539794142, 116) Initial Data Timestamp: Timestamp(1539794142, 116)2018-11-11T10:23:54.275+0000 E STORAGE [rsBackgroundSync] WiredTiger error (22) [1541931834:274955][23314:0x7f35f5bd7700], WT_CONNECTION.rollback_to_stable: rollback_to_stable illegal with active transactions: Invalid argument Raw: [1541931834:274955][23314:0x7f35f5bd7700], WT_CONNECTION.rollback_to_stable: rollback_to_stable illegal with active transactions: Invalid argument2018-11-11T10:23:54.275+0000 F ROLLBACK [rsBackgroundSync] RecoverToStableTimestamp failed. :: caused by :: UnrecoverableRollbackError: Error rolling back to stable. Err: Invalid argument
- related to
-
SERVER-36495 Cache pressure issues during recovery oplog application
- Closed