-
Type: Bug
-
Resolution: Duplicate
-
Priority: Major - P3
-
None
-
Affects Version/s: 2.0.2
-
Component/s: None
-
Linux
One of our servers started expecting severe slowdowns (working 10-100x slower) around 17:05 (time for reference only). It continued working, but very slowly, leading to other replica set deciding it's dead around 17:29.
I explored the log, and aside the "long query" warnings (queries that took 100-200ms started taking around 8000ms) there was this one:
Sun Jan 15 17:10:23 [initandlisten] connection accepted from 10.2.1.200:36130 #169664
...
Sun Jan 15 17:10:24 [conn169664] Assertion failure cc->_pinValue < 100 db/clientcursor.h 309
0x57a926 0x5857db 0x7982e9 0x882846 0x889e99 0xaa00c6 0x635bb7 0x7fa26256c9ca 0x7fa261b1b70d
/usr/bin/mongod(_ZN5mongo12sayDbContextEPKc+0x96) [0x57a926]
/usr/bin/mongod(_ZN5mongo8assertedEPKcS1_j+0xfb) [0x5857db]
/usr/bin/mongod() [0x7982e9]
/usr/bin/mongod(_ZN5mongo19receivedKillCursorsERNS_7MessageE+0xb6) [0x882846]
/usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x359) [0x889e99]
/usr/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x76) [0xaa00c6]
/usr/bin/mongod(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x287) [0x635bb7]
/lib/libpthread.so.0(+0x69ca) [0x7fa26256c9ca]
/lib/libc.so.6(clone+0x6d) [0x7fa261b1b70d]
...
Sun Jan 15 17:10:23 [initandlisten] connection accepted from 10.2.1.200:36130 #169664
Right around the problem start we saw these (there was NO mentioning of any "killcursors" before in the log):
Sun Jan 15 17:05:45 [conn169305] killcursors 64ms
Sun Jan 15 17:05:45 [conn169296] killcursors 64ms
Sun Jan 15 17:05:45 [conn169309] killcursors 60ms
which became a lot worse later:
Sun Jan 15 17:28:42 [conn171423] killcursors 2396ms
Sun Jan 15 17:28:42 [conn171404] killcursors 2396ms
Sun Jan 15 17:28:42 [conn171396] killcursors 2396ms
Sun Jan 15 17:28:42 [conn171428] killcursors 2392ms
Sun Jan 15 17:28:42 [conn171398] killcursors 2392ms
Sun Jan 15 17:28:42 [conn171459] killcursors 2392ms
Sun Jan 15 17:28:42 [conn171436] killcursors 2392ms
Sun Jan 15 17:28:42 [conn171334] killcursors 2492ms
Maybe it's our application mismanaging cursors?
At first I decided that there was some kind of a hard drive failure, leading to this, but just restarting the server fixed the problem completely, so I don't think it's a hardware problem.
The database is 99% read-only. the load is medium-light.
My guess is either some internal structures got damaged, or there is some problem with PHP driver or our pattern of it's usage.
I checked dmesg, kern.log, etc. – no signs of anything going wrong with hardware and/or kernel stuff.
Also, I checked historical atop data, between 16:30-17:00 (before the problem) it did 48120KiB reads and 5936KiB writes using around 7% CPU, and between 17:00-17:30 (the problematic interval) it did 175.2MiB reads and 142.2MiB writes using around 10% CPU (3 times more reads and 30 times more writes) – so it looks like it did a lot of unexpected I/O that our application didn't request (TCP I/O grew up around 30% max). Swap space kept mostly unused, file cache was the same.
This problem recurs without any visible causes, please advise.