Loading...

XML

Word

Printable

JSON

Type: Question
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 3.4.4
Component/s: Replication
Labels:
None

Hi,
We deployed a 3 node replica set (1-PRIMARY, 1-SECONDARY and 1-ARBITER) for POC purpose
When trying to load around 100K collections to the database, the SECONDARY could not keep with the load and went out of sync and shutdown
The load continued as there was still a PRIMARY but it then crashed with the below symptoms

1. Throughout the load, we see errors like
a. [conn270741] thread over memory limit, cleaning up, current: 498k
b. Socket say send() Broken pipe
c. Fri Aug 11 03:08:21.466 I COMMAND [conn165804] serverStatus was very slow:

{ after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after dur: 0, after extra_info: 0, after globalLock: 0, after locks: 0, after network: 0, after opLatencies: 0, after opcounters: 0, after opcountersRepl: 0, after repl: 6589, after security: 6589, after sharding: 6589, after storageEngine: 6589, after tcmalloc: 6589, after wiredTiger: 6589, at end: 6589 }

2. We see that the PRIMARY transitioned to SECONDARY multiple times (around 14 times in a day) and an election took place and was transitioned back to PRIMARY

Fri Aug 11 03:03:32.034 D REPL [ReplicationExecutor] Scheduling heartbeat to xsj-db1:27030 at 2017-08-11T10:03:33.978Z
Fri Aug 11 03:03:32.041 I REPL [ReplicationExecutor] Member xsj-db2:27030 is now in state ARBITER
Fri Aug 11 03:03:32.041 D REPL [ReplicationExecutor] Scheduling heartbeat to xsj-db2:27030 at 2017-08-11T10:03:34.041Z
Fri Aug 11 03:03:32.042 I REPL [replExecDBWorker-0] transition to SECONDARY

Fri Aug 11 03:03:43.143 I REPL [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms

Fri Aug 11 03:03:43.297 I REPL [ReplicationExecutor] election succeeded, assuming primary role in term 26
Fri Aug 11 03:03:43.298 I REPL [ReplicationExecutor] transition to PRIMARY

All the while we have checked and found that the ARBITER has been up

3. After the switchover to secondary for the 14th time, the election does not take place and the number of connections increase to 32k all the while the max number of connections was only around 415. After reaching 32k connections the database is hung and below error is recorded continously until the database process crashes

Fri Aug 11 22:35:42.361 I - [thread1] pthread_create failed: Resource temporarily unavailable
Fri Aug 11 22:35:42.365 I - [thread1] failed to create service entry worker thread for 172.19.154.189:9621

Can you please suggest what should be the action taken during such occurences?

Thanks,
Tanveer

Assignee:: Kelsey Schubert

Reporter:: Tanveer Madan Marate

Participants:: Kelsey Schubert, Tanveer Madan Marate

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: Aug 14 2017 09:50:01 PM UTC

Updated:: Feb 09 2018 06:51:48 PM UTC

Resolved:: Jan 18 2018 10:05:51 PM UTC

Details

Description

Attachments

Activity

People

Dates