Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-32652

mongod killed by OOM while one secondary lagging

    • Type: Icon: Bug Bug
    • Resolution: Gone away
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.4.10
    • Component/s: None
    • None
    • Environment:
      Ubuntu 16.04
    • ALL

      mongod 3.4.10 on Ubuntu 16.04 in a replica set with 3 nodes. The primary and a secondary consumed all RAM and were killed by the kernel OOM killer within a couple of minutes of each other.

      It's on the default setting for storage.wiredTiger.engineConfig.cacheSizeGB, so I would expect it to use around 50% of RAM.

      At the time I think there was some heavy insert activity.

      Possibly related, the other secondary had started lagging about 15-20 minutes earlier. That node is in Azure and tends to lag under load because of SERVER-31215 / WT-3461.

      Primary detecting lag:

      Jan 10 23:19:50 primary monit[12013]: 'mongo_replcheck' '/usr/local/bin/mongo_replcheck.sh' failed with exit status (1) -- azureslave:27017 lag of 445 sec exceeds threshold 300
      

      Primary running out of memory:

      Jan 10 23:31:01 primary kernel: [2494318.535214] Out of memory: Kill process 1100 (mongod) score 954 or sacrifice child
      Jan 10 23:31:01 primary kernel: [2494318.548316] Killed process 1100 (mongod) total-vm:17422016kB, anon-rss:15654812kB, file-rss:0kB
      

      Secondary running out of memory:

      Jan 10 23:33:46 secondary kernel: [2496035.849027] Out of memory: Kill process 26160 (mongod) score 955 or sacrifice child
      Jan 10 23:33:46 secondary kernel: [2496035.862134] Killed process 26160 (mongod) total-vm:17415872kB, anon-rss:15675724kB, file-rss:0kB
      

      Total memory on the primary and secondary is 16431148 KB. 14350764 KB in Azure.

      I can provide the FTDC logs privately if you are interested.

            Assignee:
            mark.agarunov Mark Agarunov
            Reporter:
            mzs Michael Smith
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: