-
Type: Bug
-
Resolution: Done
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Environment:Ubuntu 18.04.6 LTS
XSF
Kernel - 5.4.0-1088-aws #96~18.04.1-Ubuntu SMP Mon Oct 17 02:57:48 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Disable Transparent Huge disabled
AWS m5.2xlarge
SSD GP3 450 Gb
-
Server Triage
-
ALL
Hi!
On one of our shards at some point in time the consumed memory of the primary replica started to rapidly grow along with high CPU consumption. Then that replica became unresponsive, and consequently another replica became the primary. Right after that the same happened to the new primary.
The incident timeline:
- 10/24/23 7:40 - beginning (peak in CPU and memory consumption)
- 10/24/23 8:20-8:26 (can't say exact time) - the primary (replica-1) becomes unresponsive, another replica (replica-2) becomes the new primary, and we see peak in CPU and memory consumption again
- 10/24/23 8:38 - the new primary (replica-2) becomes unresponsive, another replica (replica-1) becomes the new primary
- 10/24/23 8:43 - the replica (replica-3) that didn't appear to ever assume the primary role starts experience the same problems with CPU and memory
- 10/24/23 9:20 - we manually restart replica-3, the incident ends
Unfortunately, we couldn't get to the core of the problem, but here some things we could observe:
- we noticed that the amount of open cursors jumped up to 500 at the replicas mentioned above (we use Change Streams, so it might be related)
- On replica-3 there were dozens of "hanging" aggregation commands (in secs_running we saw pretty big numbers, like 2000 seconds)
Could you help us identify the cause of the problem?
I'm attaching the diagnosting data of the aforementioned replicas (I named the files with replica-1, replica-2 and replica-3, these names correspond to the replica numbers mentioned above).