-
Type: Bug
-
Resolution: Duplicate
-
Priority: Major - P3
-
None
-
Affects Version/s: 4.4.2, 4.4.3
-
Component/s: Stability
-
None
-
Service Arch
-
ALL
-
Query Execution 2021-02-22, Query Execution 2021-03-08, Query Execution 2021-03-22, Query Execution 2021-04-05, Query Execution 2021-04-19, Query Execution 2021-05-03, Query Execution 2021-05-17
Our primary (called "archi" in the logs, IP ending in .29) crashed with a segfault at night, during a low traffic time.
One secondary (called "sonic" in the logs, IP ending in .31) became primary to take over, and immediately crashed too, with a different stack trace.
Finally, after a manual restart of servers to get the cluster with enough voting members (our voting members were slightly misconfigured at that point), another secondary (called "loquy", IP ending in .34) tried to become primary and crashed too (stack trace identical to the first secondary). After a last restart of all of them, they recovered.
I have attached all 3 log excerpts and stack traces. The "diagnostic.data" files for the day represent ~160MB total, I wasn't sure it was good form to plop that in a ticket. They are available here: https://database.lichess.org/mongo-crash/
The primary is running 4.4.2, the secondaries are on 4.4.3 (I was waiting for a maintenance window to upgrade primary and didn't dare to do it mid-incident).
The only recent admin operation on that cluster was setting the minimum opLog window to 25h (via CLI), and restarting a few secondaries (not affected by those crashes) with the matching config file setting.
- duplicates
-
SERVER-53566 Investigate and reproduce "opCtx != nullptr && _opCtx == nullptr" invariant
- Closed
- is related to
-
SERVER-49468 Invalidate previous OperationContext when a new OperationContext is created
- Closed
-
SERVER-53566 Investigate and reproduce "opCtx != nullptr && _opCtx == nullptr" invariant
- Closed