Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None
Environment:
Ubuntu 18.04.6 LTS
XSF
Kernel - 5.4.0-1088-aws #96~18.04.1-Ubuntu SMP Mon Oct 17 02:57:48 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Disable Transparent Huge disabled
AWS m5.2xlarge
SSD GP3 450 Gb

Assigned Teams:

Server Triage
Operating System:
ALL

Hi!

On one of our shards at some point in time the consumed memory of the primary replica started to rapidly grow along with high CPU consumption. Then that replica became unresponsive, and consequently another replica became the primary. Right after that the same happened to the new primary.

The incident timeline:

10/24/23 7:40 - beginning (peak in CPU and memory consumption)
10/24/23 8:20-8:26 (can't say exact time) - the primary (replica-1) becomes unresponsive, another replica (replica-2) becomes the new primary, and we see peak in CPU and memory consumption again
10/24/23 8:38 - the new primary (replica-2) becomes unresponsive, another replica (replica-1) becomes the new primary
10/24/23 8:43 - the replica (replica-3) that didn't appear to ever assume the primary role starts experience the same problems with CPU and memory
10/24/23 9:20 - we manually restart replica-3, the incident ends

Unfortunately, we couldn't get to the core of the problem, but here some things we could observe:

we noticed that the amount of open cursors jumped up to 500 at the replicas mentioned above (we use Change Streams, so it might be related)
On replica-3 there were dozens of "hanging" aggregation commands (in secs_running we saw pretty big numbers, like 2000 seconds)

Could you help us identify the cause of the problem?

I'm attaching the diagnosting data of the aforementioned replicas (I named the files with replica-1, replica-2 and replica-3, these names correspond to the replica numbers mentioned above).

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

1 - replica-1 (was primary before the incident happened).zip
34.80 MB
Oct 24 2023 01:24:05 PM UTC
2 - replica-2 (became primary during the incident).zip
50.97 MB
Oct 24 2023 01:24:09 PM UTC
3 - replica-3 (didn't notice that it was primary, but had same problems).zip
32.08 MB
Oct 24 2023 01:24:03 PM UTC
SERVER-82398.png
524 kB
Dec 14 2023 11:00:39 PM UTC

Assignee:: Yuan Fang

Reporter:: Vladimir Beliakov

Participants:: Vladimir Beliakov, Yuan Fang

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: Oct 24 2023 01:24:17 PM UTC

Updated:: Feb 15 2024 08:55:53 PM UTC

Resolved:: Feb 15 2024 08:55:53 PM UTC

Details

Description

Attachments

Attachments

Activity

People

Dates