Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Duplicate
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 4.4.2, 4.4.3
Component/s: Stability
Labels:
None

Assigned Teams:

Service Arch
Operating System:
ALL
Sprint:
Query Execution 2021-02-22, Query Execution 2021-03-08, Query Execution 2021-03-22, Query Execution 2021-04-05, Query Execution 2021-04-19, Query Execution 2021-05-03, Query Execution 2021-05-17
Confidence Status:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

Our primary (called "archi" in the logs, IP ending in .29) crashed with a segfault at night, during a low traffic time.

One secondary (called "sonic" in the logs, IP ending in .31) became primary to take over, and immediately crashed too, with a different stack trace.

Finally, after a manual restart of servers to get the cluster with enough voting members (our voting members were slightly misconfigured at that point), another secondary (called "loquy", IP ending in .34) tried to become primary and crashed too (stack trace identical to the first secondary). After a last restart of all of them, they recovered.

I have attached all 3 log excerpts and stack traces. The "diagnostic.data" files for the day represent ~160MB total, I wasn't sure it was good form to plop that in a ticket. They are available here: https://database.lichess.org/mongo-crash/

The primary is running 4.4.2, the secondaries are on 4.4.3 (I was waiting for a maintenance window to upgrade primary and didn't dare to do it mid-incident).

The only recent admin operation on that cluster was setting the minimum opLog window to 25h (via CLI), and restarting a few secondaries (not affected by those crashes) with the matching config file setting.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

archi.log.bz2
Jan 17 2021 09:40:26 AM UTC
14 kB
Lucas Bonnet
loquy.log.bz2
Jan 17 2021 09:40:34 AM UTC
65 kB
Lucas Bonnet
mongod.conf
Jan 18 2021 04:24:07 PM UTC
0.4 kB
Lucas Bonnet
sonic.log.bz2
Jan 17 2021 09:40:37 AM UTC
13 kB
Lucas Bonnet
trace-2-archi.txt
Feb 16 2021 09:28:18 AM UTC
14 kB
Lucas Bonnet
trace-2-higgs.txt
Feb 16 2021 09:28:17 AM UTC
33 kB
Lucas Bonnet
trace-2-loquy.txt
Feb 16 2021 09:28:17 AM UTC
35 kB
Lucas Bonnet
trace-archi.txt
Jan 17 2021 09:40:41 AM UTC
46 kB
Lucas Bonnet
trace-sonic.txt
Jan 17 2021 09:40:43 AM UTC
8 kB
Lucas Bonnet

duplicates

SERVER-53566 Investigate and reproduce "opCtx != nullptr && _opCtx == nullptr" invariant

Closed

is related to

SERVER-49468 Invalidate previous OperationContext when a new OperationContext is created

Closed

SERVER-53566 Investigate and reproduce "opCtx != nullptr && _opCtx == nullptr" invariant

Closed

Assignee:: [DO NOT USE] Backlog - Service Architecture

Reporter:: Lucas Bonnet

Participants:: [DO NOT USE] Backlog - Service Architecture, Billy Donahue, Dmitry Agranat, Kyle Suarez, Lucas Bonnet

Votes:: 2 Vote for this issue

Watchers:: 13 Start watching this issue

Created:: Jan 17 2021 09:50:29 AM UTC

Updated:: Dec 06 2022 01:38:30 AM UTC

Resolved:: May 25 2021 04:13:55 PM UTC

GA Target Date:: None

Public Preview Target Date:: None

Private Preview Target Date:: None

Experiment Target Date:: None

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates