-
Type: Bug
-
Resolution: Community Answered
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Environment:Ubuntu 16.04
XSF
Kernel - 4.4.0-1128-aws #142-Ubuntu SMP Fri Apr 16 12:42:33 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Disable Transparent Huge disabled
AWS m5.large (2cpu\8gb)
SSD GP3 450 Gb
monogo-org-server - 4.2.17
-
ALL
For no apparent reason, our primary replica member of one of the shards got unresponsive until we restarted it.
The incident lasted for about 35 minutes. During that time we saw almost 100% consumption on the primary and its load average was up to 60 times the normal values.
From the logs (as of the beginning of the incident) we understood only this:
1) The amount of opened connections started increasing.
2) Some opened cursors got timed out.
3) The pooled connections to other members got dropped (due to shutdown, but we didn't try to shut down the primary at that time).
```
I CONNPOOL [TaskExecutorPool-0] Dropping all pooled connections to some-secondary:27017 due to ShutdownInProgress: Pool for some-secondary:27017 has expired.
```
4) After some time no log entries appear (for about 25 minutes) until we restarted the primary.
Our cluster configuration:
- shard cluster with 10 shards
- four replicas in each shard
- about 400 GB of data in storage size per shard
Replica server configuration:
- Ubuntu 16.04
- XSF
- Kernel - 4.4.0-1128-aws #142-Ubuntu SMP Fri Apr 16 12:42:33 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
- Disable Transparent Huge disabled
- AWS m5.large (2cpu\8gb)
- SSD GP3 450 Gb
- monogo-org-server - 4.2.17
`diagnostic.data` of the primary and one of the secondary attached to the post.