Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-30652

PRIMARY switching over to SECONDARY frequently

    • Type: Icon: Question Question
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.4.4
    • Component/s: Replication
    • None

      Hi,
      We deployed a 3 node replica set (1-PRIMARY, 1-SECONDARY and 1-ARBITER) for POC purpose
      When trying to load around 100K collections to the database, the SECONDARY could not keep with the load and went out of sync and shutdown
      The load continued as there was still a PRIMARY but it then crashed with the below symptoms

      1. Throughout the load, we see errors like
      a. [conn270741] thread over memory limit, cleaning up, current: 498k
      b. Socket say send() Broken pipe
      c. Fri Aug 11 03:08:21.466 I COMMAND [conn165804] serverStatus was very slow:

      { after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after dur: 0, after extra_info: 0, after globalLock: 0, after locks: 0, after network: 0, after opLatencies: 0, after opcounters: 0, after opcountersRepl: 0, after repl: 6589, after security: 6589, after sharding: 6589, after storageEngine: 6589, after tcmalloc: 6589, after wiredTiger: 6589, at end: 6589 }

      2. We see that the PRIMARY transitioned to SECONDARY multiple times (around 14 times in a day) and an election took place and was transitioned back to PRIMARY

      Fri Aug 11 03:03:32.034 D REPL [ReplicationExecutor] Scheduling heartbeat to xsj-db1:27030 at 2017-08-11T10:03:33.978Z
      Fri Aug 11 03:03:32.041 I REPL [ReplicationExecutor] Member xsj-db2:27030 is now in state ARBITER
      Fri Aug 11 03:03:32.041 D REPL [ReplicationExecutor] Scheduling heartbeat to xsj-db2:27030 at 2017-08-11T10:03:34.041Z
      Fri Aug 11 03:03:32.042 I REPL [replExecDBWorker-0] transition to SECONDARY

      Fri Aug 11 03:03:43.143 I REPL [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms

      Fri Aug 11 03:03:43.297 I REPL [ReplicationExecutor] election succeeded, assuming primary role in term 26
      Fri Aug 11 03:03:43.298 I REPL [ReplicationExecutor] transition to PRIMARY

      All the while we have checked and found that the ARBITER has been up

      3. After the switchover to secondary for the 14th time, the election does not take place and the number of connections increase to 32k all the while the max number of connections was only around 415. After reaching 32k connections the database is hung and below error is recorded continously until the database process crashes

      Fri Aug 11 22:35:42.361 I - [thread1] pthread_create failed: Resource temporarily unavailable
      Fri Aug 11 22:35:42.365 I - [thread1] failed to create service entry worker thread for 172.19.154.189:9621

      Can you please suggest what should be the action taken during such occurences?

      Thanks,
      Tanveer

            Assignee:
            kelsey.schubert@mongodb.com Kelsey Schubert
            Reporter:
            tanveermadan@gmail.com Tanveer Madan Marate
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: