-
Type: Bug
-
Resolution: Done
-
Priority: Major - P3
-
None
-
Affects Version/s: 2.0.0
-
Component/s: Querying
-
None
-
Environment:Debian wheezy (I'm in testing cause I prefered to get packaged mongo, before the 2.0). It's a secondary behind a Mongo 1.8 in a replica set with an arbiter.
-
Linux
I'm running a fresh install of mongo 2.0.0 (git version 695c67df)
It works almost well :
1->30% usr cpu, 0% wait, load average : 0.01
Albeit it have some slow queries :
$ tail /var/log/mongodb/mongodb.log Tue Sep 27 11:21:33 [initandlisten] connection accepted from 10.0.2.1:58299 #89 Tue Sep 27 11:21:39 [conn24] query my.mesh ntoreturn:25 nscanned:12897 scanAndOrder:1 nreturned:1 reslen:199 101ms Tue Sep 27 11:22:19 [conn7] query my.mesh ntoreturn:30 nscanned:35679 scanAndOrder:1 nreturned:30 reslen:3992 454ms Tue Sep 27 11:23:05 [conn37] query my.mesh ntoreturn:30 nscanned:15470 scanAndOrder:1 nreturned:30 reslen:3920 169ms Tue Sep 27 11:23:23 [conn20] query my.mesh ntoreturn:30 nscanned:12876 scanAndOrder:1 nreturned:30 reslen:3478 287ms Tue Sep 27 11:24:13 [initandlisten] connection accepted from 10.0.2.1:36205 #90 Tue Sep 27 11:24:32 [conn26] query my.mesh ntoreturn:30 nscanned:30234 scanAndOrder:1 nreturned:30 reslen:3410 452ms Tue Sep 27 11:24:39 [conn90] end connection 10.0.2.1:36205 Tue Sep 27 11:25:07 [clientcursormon] mem (MB) res:895 virt:37958 mapped:18792 Tue Sep 27 11:25:51 [conn6] query my.mesh ntoreturn:30 nscanned:9120 scanAndOrder:1 nreturned:30 reslen:3495 165ms
Some stats where all is running fine :
$ du -sh /srv/mongodb 25G > db.mesh.count() 8641843 > db.mesh.totalIndexSize() 2512363440 > db.tinyurl.count() 557262 > db.tinyurl.totalIndexSize() 64526016 $ free total used free shared buffers cached Mem: 4056032 3883104 172928 0 34480 3712764 -/+ buffers/cache: 135860 3920172 Swap: 0 0 0 $ mongostat connected to: 127.0.0.1 insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn set repl time *0 13 *0 *0 0 4|0 0 18.4g 37.3g 825m 0 0 0 0|0 0|0 1k 3k 53 shape SEC 11:36:28 *0 29 *0 *0 0 5|0 0 18.4g 37.3g 825m 0 0 0 0|0 0|0 2k 5k 53 shape SEC 11:36:29 *0 9 *0 *0 0 5|0 0 18.4g 37.3g 825m 0 0 0 0|0 0|0 994b 3k 53 shape SEC 11:36:30 *0 16 *0 *0 0 9|0 0 18.4g 37.3g 825m 0 0 0 0|0 0|0 1k 5k 53 shape SEC 11:36:31 *0 47 *0 *0 0 8|0 0 18.4g 37.3g 825m 0 0 0 0|0 0|0 4k 16k 53 shape SEC 11:36:32 *0 9 *0 *0 0 4|0 0 18.4g 37.3g 825m 0 0 0 0|0 0|0 936b 3k 53 shape SEC 11:36:33 *0 20 *0 *0 0 8|0 0 18.4g 37.3g 825m 0 0 0 0|0 0|0 1k 5k 53 shape SEC 11:36:34 *2 7 *0 *0 0 5|0 0 18.4g 37.3g 825m 0 0 0 0|0 0|0 859b 3k 53 shape SEC 11:36:35 *2 4 *0 *0 0 6|0 0 18.4g 37.3g 825m 0 0 0 0|0 0|0 702b 3k 53 shape SEC 11:36:36 *3 9 *0 *0 0 7|0 0 18.4g 37.3g 749m 1 0.5 0 0|0 0|0 1k 4k 53 shape SEC 11:36:37 $ iostat -x 2 Linux 3.0.0-1-amd64 (mongo) 09/27/2011 _x86_64_ (4 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 68.53 0.00 0.21 1.11 0.00 30.15 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.21 82.28 10.69 3.33 388.30 366.34 107.62 2.05 146.01 9.84 582.77 2.63 3.69 avg-cpu: %user %nice %system %iowait %steal %idle 0.77 0.00 0.13 0.13 0.00 98.97 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.00 3.50 0.00 64.00 0.00 36.57 0.03 8.00 8.00 0.00 2.29 0.80 avg-cpu: %user %nice %system %iowait %steal %idle 0.25 0.00 0.12 0.00 0.00 99.63 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 0.00 0.00 0.00 0.00 0.00 100.00 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
It can run fine for some couples of minutes (~10, 20, depends), and suddenly eat up to 100% user CPU let the load average grow slowly to 600, here some stats where he's at 100% CPU and > 50 Load average
mongostat connected to: 127.0.0.1 insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn set repl time *0 6 *0 *0 0 3|0 0 18.4g 37.5g 1.87g 0 0 0 0|0 50|0 825b 2k 101 shape SEC 11:54:50 *0 16 *0 *0 0 7|0 0 18.4g 37.5g 1.87g 0 0 0 0|0 52|0 1k 4k 101 shape SEC 11:54:51 *0 5 *0 *0 0 6|0 0 18.4g 37.5g 1.87g 0 0 0 0|0 51|0 952b 3k 101 shape SEC 11:54:52 *0 9 *0 *0 0 8|0 0 18.4g 37.5g 1.87g 0 0 0 0|0 50|0 1k 4k 101 shape SEC 11:54:53 *0 15 *0 *0 0 4|0 0 18.4g 37.5g 1.87g 0 0 0 0|0 50|0 1k 5k 101 shape SEC 11:54:54 *0 9 *0 *0 0 9|0 0 18.4g 37.5g 1.87g 0 0 0 0|0 52|0 1k 4k 101 shape SEC 11:54:55 *0 12 *0 *0 0 5|0 0 18.4g 37.5g 1.87g 0 0 0 0|0 51|0 1k 6k 100 shape SEC 11:54:56 *0 15 *0 *0 0 6|0 0 18.4g 37.5g 1.87g 0 0 0 0|0 51|0 1k 4k 100 shape SEC 11:54:57 *0 16 *0 *0 0 9|0 0 18.4g 37.5g 1.87g 0 0 0 0|0 51|0 1k 31k 100 shape SEC 11:54:58 *0 31 *0 *0 0 6|0 0 18.4g 37.5g 1.87g 0 0 0 0|0 49|0 2k 6k 101 shape SEC 11:54:59 iostat -x 2 Linux 3.0.0-1-amd64 (mongo) 09/27/2011 _x86_64_ (4 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 68.81 0.00 0.21 1.08 0.00 29.90 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.20 80.21 10.43 3.28 378.63 357.27 107.37 2.00 145.60 9.84 577.20 2.63 3.60 avg-cpu: %user %nice %system %iowait %steal %idle 100.00 0.00 0.00 0.00 0.00 0.00 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 100.00 0.00 0.00 0.00 0.00 0.00 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 99.88 0.00 0.12 0.00 0.00 0.00 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 2.50 0.00 1.50 0.00 16.00 21.33 0.00 0.00 0.00 0.00 0.00 0.00 avg-cpu: %user %nice %system %iowait %steal %idle 100.00 0.00 0.00 0.00 0.00 0.00 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 tail /var/log/mongodb/mongodb.log Tue Sep 27 11:56:20 [conn394] query my.tinyurl ntoreturn:1 nscanned:1 nreturned:1 reslen:91 110ms Tue Sep 27 11:56:21 [conn356] query my.tinyurl ntoreturn:1 nscanned:1 nreturned:1 reslen:91 160ms Tue Sep 27 11:56:21 [conn359] query my.tinyurl ntoreturn:1 nscanned:1 nreturned:1 reslen:163 219ms Tue Sep 27 11:56:21 [conn366] query my.tinyurl ntoreturn:1 nscanned:1 nreturned:1 reslen:91 116ms Tue Sep 27 11:56:21 [initandlisten] connection accepted from 10.0.2.1:58460 #452 Tue Sep 27 11:56:21 [initandlisten] connection accepted from 10.0.2.1:58461 #453 Tue Sep 27 11:56:21 [conn452] end connection 10.0.2.1:58460 Tue Sep 27 11:56:21 [conn417] query my.tinyurl ntoreturn:1 nscanned:1 nreturned:1 reslen:91 192ms Tue Sep 27 11:56:22 [conn366] query my.tinyurl ntoreturn:1 nscanned:1 nreturned:1 reslen:91 166ms Tue Sep 27 11:56:22 [conn426] query my.tinyurl ntoreturn:1 nscanned:1 nreturned:1 reslen:91 648ms
I don't have hyperthreading nor NUMA.
I graphed it on munin and I can't see something interesting comparing period where mongo is find and period where it's slow :
- Iostat is stable < 100 blocks read / s and < 20 blocks write / s
- Btree stats do not vary
- Curront connections go from ~50 to ~800, that is normal as every query went very slow
- Mongo memory usage do not vary
- MongoDB ops decrease (that is normal, as every ops take much time)
- Write lock percentage is stable around 0.02%
- vmstat running go from 0 to 800 and iowait remains at 0
- When slow, cpu is like a barcode ! sometime not graphing sometime 100% user
cat /proc/cpuinfo gave 4 times :
processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Xeon(R) CPU X3363 @ 2.83GHz stepping : 10 cpu MHz : 2833.168 cache size : 6144 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 xsave lahf_lm dts tpr_shadow vnmi flexpriority bogomips : 5666.33 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management:
- is related to
-
SERVER-2212 Server automatically abort queries/commands after user-specified time limit
- Closed