Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-38720

3-node replica set: periodically load avg above 100 on primary, unable to answer queries

    • Type: Icon: Question Question
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None

      Hi,

       

      we're running a 3-node replica set with MongoDB version 4.0.1. Until recently we have been running the same data set on a replica set with version 2.4 and we have seen the same issue.

       

      Once in a while we suddenly see load spiking on the primary node and active reads piling up. See attached screenshot from our Grafana dashboard. When this happens, the cluster is unable to answer queries at all - the short-hand solution is to either rs.StepDown() or restart the mongod on the primary completely.

       

      We want to ask for input on how to go from here to debug this. We couldn't spot a query yet which seems suspect to cause this. The replica set was running fine for years before the issue first appeared a few month ago and we're unsure what is causing this.

       

      Attached are MongoDB metrics and host metrics where the problem can be seen.

       

      Thanks!

        1. mongo3.png
          mongo3.png
          190 kB
        2. mongo2.png
          mongo2.png
          195 kB
        3. mongo1.png
          mongo1.png
          73 kB

            Assignee:
            daniel.hatcher@mongodb.com Danny Hatcher (Inactive)
            Reporter:
            steinborn Frank Steinborn
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: