Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-3960

High CPU usage no IO Wait

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 2.0.0
    • Component/s: Querying
    • None
    • Environment:
      Debian wheezy (I'm in testing cause I prefered to get packaged mongo, before the 2.0). It's a secondary behind a Mongo 1.8 in a replica set with an arbiter.
    • Linux

      I'm running a fresh install of mongo 2.0.0 (git version 695c67df)
      It works almost well :
      1->30% usr cpu, 0% wait, load average : 0.01
      Albeit it have some slow queries :

      $ tail /var/log/mongodb/mongodb.log 
      Tue Sep 27 11:21:33 [initandlisten] connection accepted from 10.0.2.1:58299 #89
      Tue Sep 27 11:21:39 [conn24] query my.mesh ntoreturn:25 nscanned:12897 scanAndOrder:1 nreturned:1 reslen:199 101ms
      Tue Sep 27 11:22:19 [conn7] query my.mesh ntoreturn:30 nscanned:35679 scanAndOrder:1 nreturned:30 reslen:3992 454ms
      Tue Sep 27 11:23:05 [conn37] query my.mesh ntoreturn:30 nscanned:15470 scanAndOrder:1 nreturned:30 reslen:3920 169ms
      Tue Sep 27 11:23:23 [conn20] query my.mesh ntoreturn:30 nscanned:12876 scanAndOrder:1 nreturned:30 reslen:3478 287ms
      Tue Sep 27 11:24:13 [initandlisten] connection accepted from 10.0.2.1:36205 #90
      Tue Sep 27 11:24:32 [conn26] query my.mesh ntoreturn:30 nscanned:30234 scanAndOrder:1 nreturned:30 reslen:3410 452ms
      Tue Sep 27 11:24:39 [conn90] end connection 10.0.2.1:36205
      Tue Sep 27 11:25:07 [clientcursormon] mem (MB) res:895 virt:37958 mapped:18792
      Tue Sep 27 11:25:51 [conn6] query my.mesh ntoreturn:30 nscanned:9120 scanAndOrder:1 nreturned:30 reslen:3495 165ms
      

      Some stats where all is running fine :

      $ du -sh /srv/mongodb
      25G
      > db.mesh.count()
      8641843
      > db.mesh.totalIndexSize()
      2512363440
      > db.tinyurl.count()
      557262
      > db.tinyurl.totalIndexSize()
      64526016
      
      $ free
                   total       used       free     shared    buffers     cached
      Mem:       4056032    3883104     172928          0      34480    3712764
      -/+ buffers/cache:     135860    3920172
      Swap:            0          0          0
      
      $ mongostat
      connected to: 127.0.0.1
      insert  query update delete getmore command flushes mapped  vsize    res faults locked % idx miss %     qr|qw   ar|aw  netIn netOut  conn   set repl       time 
          *0     13     *0     *0       0     4|0       0  18.4g  37.3g   825m      0        0          0       0|0     0|0     1k     3k    53 shape  SEC   11:36:28 
          *0     29     *0     *0       0     5|0       0  18.4g  37.3g   825m      0        0          0       0|0     0|0     2k     5k    53 shape  SEC   11:36:29 
          *0      9     *0     *0       0     5|0       0  18.4g  37.3g   825m      0        0          0       0|0     0|0   994b     3k    53 shape  SEC   11:36:30 
          *0     16     *0     *0       0     9|0       0  18.4g  37.3g   825m      0        0          0       0|0     0|0     1k     5k    53 shape  SEC   11:36:31 
          *0     47     *0     *0       0     8|0       0  18.4g  37.3g   825m      0        0          0       0|0     0|0     4k    16k    53 shape  SEC   11:36:32 
          *0      9     *0     *0       0     4|0       0  18.4g  37.3g   825m      0        0          0       0|0     0|0   936b     3k    53 shape  SEC   11:36:33 
          *0     20     *0     *0       0     8|0       0  18.4g  37.3g   825m      0        0          0       0|0     0|0     1k     5k    53 shape  SEC   11:36:34 
          *2      7     *0     *0       0     5|0       0  18.4g  37.3g   825m      0        0          0       0|0     0|0   859b     3k    53 shape  SEC   11:36:35 
          *2      4     *0     *0       0     6|0       0  18.4g  37.3g   825m      0        0          0       0|0     0|0   702b     3k    53 shape  SEC   11:36:36 
          *3      9     *0     *0       0     7|0       0  18.4g  37.3g   749m      1      0.5          0       0|0     0|0     1k     4k    53 shape  SEC   11:36:37 
      
      $ iostat -x 2
      Linux 3.0.0-1-amd64 (mongo)    09/27/2011      _x86_64_        (4 CPU)
      
      avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                68.53    0.00    0.21    1.11    0.00   30.15
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
      sda               0.21    82.28   10.69    3.33   388.30   366.34   107.62     2.05  146.01    9.84  582.77   2.63   3.69
      
      avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                 0.77    0.00    0.13    0.13    0.00   98.97
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
      sda               0.00     0.00    3.50    0.00    64.00     0.00    36.57     0.03    8.00    8.00    0.00   2.29   0.80
      
      avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                 0.25    0.00    0.12    0.00    0.00   99.63
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
      sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
      
      avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                 0.00    0.00    0.00    0.00    0.00  100.00
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
      sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
      
      
      

      It can run fine for some couples of minutes (~10, 20, depends), and suddenly eat up to 100% user CPU let the load average grow slowly to 600, here some stats where he's at 100% CPU and > 50 Load average

      mongostat
      connected to: 127.0.0.1
      insert  query update delete getmore command flushes mapped  vsize    res faults locked % idx miss %     qr|qw   ar|aw  netIn netOut  conn   set repl       time 
          *0      6     *0     *0       0     3|0       0  18.4g  37.5g  1.87g      0        0          0       0|0    50|0   825b     2k   101 shape  SEC   11:54:50 
          *0     16     *0     *0       0     7|0       0  18.4g  37.5g  1.87g      0        0          0       0|0    52|0     1k     4k   101 shape  SEC   11:54:51 
          *0      5     *0     *0       0     6|0       0  18.4g  37.5g  1.87g      0        0          0       0|0    51|0   952b     3k   101 shape  SEC   11:54:52 
          *0      9     *0     *0       0     8|0       0  18.4g  37.5g  1.87g      0        0          0       0|0    50|0     1k     4k   101 shape  SEC   11:54:53 
          *0     15     *0     *0       0     4|0       0  18.4g  37.5g  1.87g      0        0          0       0|0    50|0     1k     5k   101 shape  SEC   11:54:54 
          *0      9     *0     *0       0     9|0       0  18.4g  37.5g  1.87g      0        0          0       0|0    52|0     1k     4k   101 shape  SEC   11:54:55 
          *0     12     *0     *0       0     5|0       0  18.4g  37.5g  1.87g      0        0          0       0|0    51|0     1k     6k   100 shape  SEC   11:54:56 
          *0     15     *0     *0       0     6|0       0  18.4g  37.5g  1.87g      0        0          0       0|0    51|0     1k     4k   100 shape  SEC   11:54:57 
          *0     16     *0     *0       0     9|0       0  18.4g  37.5g  1.87g      0        0          0       0|0    51|0     1k    31k   100 shape  SEC   11:54:58 
          *0     31     *0     *0       0     6|0       0  18.4g  37.5g  1.87g      0        0          0       0|0    49|0     2k     6k   101 shape  SEC   11:54:59 
      
      
      iostat -x 2
      Linux 3.0.0-1-amd64 (mongo)    09/27/2011      _x86_64_        (4 CPU)
      
      avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                68.81    0.00    0.21    1.08    0.00   29.90
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
      sda               0.20    80.21   10.43    3.28   378.63   357.27   107.37     2.00  145.60    9.84  577.20   2.63   3.60
      
      avg-cpu:  %user   %nice %system %iowait  %steal   %idle
               100.00    0.00    0.00    0.00    0.00    0.00
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
      sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
      
      avg-cpu:  %user   %nice %system %iowait  %steal   %idle
               100.00    0.00    0.00    0.00    0.00    0.00
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
      sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
      
      avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                99.88    0.00    0.12    0.00    0.00    0.00
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
      sda               0.00     2.50    0.00    1.50     0.00    16.00    21.33     0.00    0.00    0.00    0.00   0.00   0.00
      
      avg-cpu:  %user   %nice %system %iowait  %steal   %idle
               100.00    0.00    0.00    0.00    0.00    0.00
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
      sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
      
      tail /var/log/mongodb/mongodb.log 
      Tue Sep 27 11:56:20 [conn394] query my.tinyurl ntoreturn:1 nscanned:1 nreturned:1 reslen:91 110ms
      Tue Sep 27 11:56:21 [conn356] query my.tinyurl ntoreturn:1 nscanned:1 nreturned:1 reslen:91 160ms
      Tue Sep 27 11:56:21 [conn359] query my.tinyurl ntoreturn:1 nscanned:1 nreturned:1 reslen:163 219ms
      Tue Sep 27 11:56:21 [conn366] query my.tinyurl ntoreturn:1 nscanned:1 nreturned:1 reslen:91 116ms
      Tue Sep 27 11:56:21 [initandlisten] connection accepted from 10.0.2.1:58460 #452
      Tue Sep 27 11:56:21 [initandlisten] connection accepted from 10.0.2.1:58461 #453
      Tue Sep 27 11:56:21 [conn452] end connection 10.0.2.1:58460
      Tue Sep 27 11:56:21 [conn417] query my.tinyurl ntoreturn:1 nscanned:1 nreturned:1 reslen:91 192ms
      Tue Sep 27 11:56:22 [conn366] query my.tinyurl ntoreturn:1 nscanned:1 nreturned:1 reslen:91 166ms
      Tue Sep 27 11:56:22 [conn426] query my.tinyurl ntoreturn:1 nscanned:1 nreturned:1 reslen:91 648ms
      
      

      I don't have hyperthreading nor NUMA.
      I graphed it on munin and I can't see something interesting comparing period where mongo is find and period where it's slow :

      • Iostat is stable < 100 blocks read / s and < 20 blocks write / s
      • Btree stats do not vary
      • Curront connections go from ~50 to ~800, that is normal as every query went very slow
      • Mongo memory usage do not vary
      • MongoDB ops decrease (that is normal, as every ops take much time)
      • Write lock percentage is stable around 0.02%
      • vmstat running go from 0 to 800 and iowait remains at 0
      • When slow, cpu is like a barcode ! sometime not graphing sometime 100% user

      cat /proc/cpuinfo gave 4 times :

      processor       : 0
      vendor_id       : GenuineIntel
      cpu family      : 6
      model           : 23
      model name      : Intel(R) Xeon(R) CPU           X3363  @ 2.83GHz
      stepping        : 10
      cpu MHz         : 2833.168
      cache size      : 6144 KB
      physical id     : 0
      siblings        : 4
      core id         : 0
      cpu cores       : 4
      apicid          : 0
      initial apicid  : 0
      fpu             : yes
      fpu_exception   : yes
      cpuid level     : 13
      wp              : yes
      flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 xsave lahf_lm dts tpr_shadow vnmi flexpriority
      bogomips        : 5666.33
      clflush size    : 64
      cache_alignment : 64
      address sizes   : 36 bits physical, 48 bits virtual
      power management:
      

            Assignee:
            Unassigned Unassigned
            Reporter:
            root Julien
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: