-
Type: Question
-
Resolution: Done
-
Priority: Major - P3
-
None
-
Affects Version/s: 3.4.4
-
Component/s: Replication
-
None
Hi,
We deployed a 3 node replica set (1-PRIMARY, 1-SECONDARY and 1-ARBITER) for POC purpose
When trying to load around 100K collections to the database, the SECONDARY could not keep with the load and went out of sync and shutdown
The load continued as there was still a PRIMARY but it then crashed with the below symptoms
1. Throughout the load, we see errors like
a. [conn270741] thread over memory limit, cleaning up, current: 498k
b. Socket say send() Broken pipe
c. Fri Aug 11 03:08:21.466 I COMMAND [conn165804] serverStatus was very slow:
2. We see that the PRIMARY transitioned to SECONDARY multiple times (around 14 times in a day) and an election took place and was transitioned back to PRIMARY
Fri Aug 11 03:03:32.034 D REPL [ReplicationExecutor] Scheduling heartbeat to xsj-db1:27030 at 2017-08-11T10:03:33.978Z
Fri Aug 11 03:03:32.041 I REPL [ReplicationExecutor] Member xsj-db2:27030 is now in state ARBITER
Fri Aug 11 03:03:32.041 D REPL [ReplicationExecutor] Scheduling heartbeat to xsj-db2:27030 at 2017-08-11T10:03:34.041Z
Fri Aug 11 03:03:32.042 I REPL [replExecDBWorker-0] transition to SECONDARY
Fri Aug 11 03:03:43.143 I REPL [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
Fri Aug 11 03:03:43.297 I REPL [ReplicationExecutor] election succeeded, assuming primary role in term 26
Fri Aug 11 03:03:43.298 I REPL [ReplicationExecutor] transition to PRIMARY
All the while we have checked and found that the ARBITER has been up
3. After the switchover to secondary for the 14th time, the election does not take place and the number of connections increase to 32k all the while the max number of connections was only around 415. After reaching 32k connections the database is hung and below error is recorded continously until the database process crashes
Fri Aug 11 22:35:42.361 I - [thread1] pthread_create failed: Resource temporarily unavailable
Fri Aug 11 22:35:42.365 I - [thread1] failed to create service entry worker thread for 172.19.154.189:9621
Can you please suggest what should be the action taken during such occurences?
Thanks,
Tanveer