Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-4684

Severe server slowdown

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 2.0.2
    • Component/s: None
    • Linux

      One of our servers started expecting severe slowdowns (working 10-100x slower) around 17:05 (time for reference only). It continued working, but very slowly, leading to other replica set deciding it's dead around 17:29.
      I explored the log, and aside the "long query" warnings (queries that took 100-200ms started taking around 8000ms) there was this one:
      Sun Jan 15 17:10:23 [initandlisten] connection accepted from 10.2.1.200:36130 #169664
      ...
      Sun Jan 15 17:10:24 [conn169664] Assertion failure cc->_pinValue < 100 db/clientcursor.h 309
      0x57a926 0x5857db 0x7982e9 0x882846 0x889e99 0xaa00c6 0x635bb7 0x7fa26256c9ca 0x7fa261b1b70d
      /usr/bin/mongod(_ZN5mongo12sayDbContextEPKc+0x96) [0x57a926]
      /usr/bin/mongod(_ZN5mongo8assertedEPKcS1_j+0xfb) [0x5857db]
      /usr/bin/mongod() [0x7982e9]
      /usr/bin/mongod(_ZN5mongo19receivedKillCursorsERNS_7MessageE+0xb6) [0x882846]
      /usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x359) [0x889e99]
      /usr/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x76) [0xaa00c6]
      /usr/bin/mongod(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x287) [0x635bb7]
      /lib/libpthread.so.0(+0x69ca) [0x7fa26256c9ca]
      /lib/libc.so.6(clone+0x6d) [0x7fa261b1b70d]
      ...
      Sun Jan 15 17:10:23 [initandlisten] connection accepted from 10.2.1.200:36130 #169664

      Right around the problem start we saw these (there was NO mentioning of any "killcursors" before in the log):
      Sun Jan 15 17:05:45 [conn169305] killcursors 64ms
      Sun Jan 15 17:05:45 [conn169296] killcursors 64ms
      Sun Jan 15 17:05:45 [conn169309] killcursors 60ms

      which became a lot worse later:
      Sun Jan 15 17:28:42 [conn171423] killcursors 2396ms
      Sun Jan 15 17:28:42 [conn171404] killcursors 2396ms
      Sun Jan 15 17:28:42 [conn171396] killcursors 2396ms
      Sun Jan 15 17:28:42 [conn171428] killcursors 2392ms
      Sun Jan 15 17:28:42 [conn171398] killcursors 2392ms
      Sun Jan 15 17:28:42 [conn171459] killcursors 2392ms
      Sun Jan 15 17:28:42 [conn171436] killcursors 2392ms
      Sun Jan 15 17:28:42 [conn171334] killcursors 2492ms

      Maybe it's our application mismanaging cursors?

      At first I decided that there was some kind of a hard drive failure, leading to this, but just restarting the server fixed the problem completely, so I don't think it's a hardware problem.
      The database is 99% read-only. the load is medium-light.

      My guess is either some internal structures got damaged, or there is some problem with PHP driver or our pattern of it's usage.

      I checked dmesg, kern.log, etc. – no signs of anything going wrong with hardware and/or kernel stuff.
      Also, I checked historical atop data, between 16:30-17:00 (before the problem) it did 48120KiB reads and 5936KiB writes using around 7% CPU, and between 17:00-17:30 (the problematic interval) it did 175.2MiB reads and 142.2MiB writes using around 10% CPU (3 times more reads and 30 times more writes) – so it looks like it did a lot of unexpected I/O that our application didn't request (TCP I/O grew up around 30% max). Swap space kept mostly unused, file cache was the same.

      This problem recurs without any visible causes, please advise.

            Assignee:
            Unassigned Unassigned
            Reporter:
            onyxmaster Aristarkh Zagorodnikov
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

              Created:
              Updated:
              Resolved: