Loading...

Type: Bug
Resolution: Done
Priority: Critical - P2
Fix Version/s: 3.0.0-rc8
Affects Version/s: 3.0.0-rc6
Component/s: WiredTiger
Labels:
None

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Steps To Reproduce:

Hide

1. start mongod with oplog and run rs.initiate() in the shell.
2. Run the attached IncFewSmallDoc.js test.

Show
1. start mongod with oplog and run rs.initiate() in the shell. 2. Run the attached IncFewSmallDoc.js test.
Confidence Status:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

Ran into this issue while running a mongo-perf test against wiredTiger. In a collection with 100 documents and a high number of concurrent threads (seem this with thread count > 80) updating them, throughput would drop significantly after 10 to 20 seconds. After this occured, commands like "show dbs" in the shell would not return.

Additional information observed so far.

Saw this in a single node replset. Without oplog, did not see this issue.
Only saw this in RC6. Did not see this in RC5.
Have been able to consistently reproduce this on a couple different machines.

mongostat output when this is happening. Notice how the throughput drop is corresponding to the very suspicious change in the "%dirty" and "%used" columns. Almost seems like some kind of overflow/underflow has occured.

insert query update delete getmore command % dirty % used flushes  vsize   res qr|qw ar|aw netIn netOut conn     set repl     time
    *0    *0     *0     *0       0     1|0     0.0    0.0       0 626.0M 46.0M   0|0   0|0   79b    16k    2 cap1442  PRI 04:07:26
   100    *0  15019     *0       0   114|0     0.0    0.0       0 723.0M 90.0M   0|0  0|75    2m    58k   83 cap1442  PRI 04:07:27
    *0    *0  30118     *0       0   303|0     0.0    0.0       0 740.0M 107.0M   0|0  0|76    4m    60k   83 cap1442  PRI 04:07:28
    *0    *0  33133     *0       0   328|0     0.0    0.0       0 752.0M 119.0M   0|0  0|74    5m    63k   83 cap1442  PRI 04:07:29
    *0    *0  33137     *0       0   334|0     0.0    0.0       0 761.0M 128.0M   0|0  0|71    5m    64k   83 cap1442  PRI 04:07:30
    *0    *0  28873     *0       0   286|0     0.0    0.0       0 767.0M 133.0M   0|0  0|70    4m    57k   83 cap1442  PRI 04:07:31
    *0    *0  31345     *0       0   321|0     0.0    0.0       0 776.0M 143.0M   0|0  0|77    4m    62k   83 cap1442  PRI 04:07:32
    *0    *0  31867     *0       0   318|0     0.0    0.0       0 785.0M 151.0M   0|0  0|64    4m    62k   83 cap1442  PRI 04:07:33
    *0    *0  33964     *0       0   345|0     0.0    0.0       0 795.0M 162.0M   0|0  0|76    5m    66k   83 cap1442  PRI 04:07:34
    *0    *0  32205     *0       0   323|0     0.1    0.1       0 799.0M 166.0M   0|0  0|80    4m    62k   83 cap1442  PRI 04:07:35
insert query update delete getmore command % dirty % used flushes  vsize    res qr|qw ar|aw netIn netOut conn     set repl     time
    *0    *0  32151     *0       0   325|0     0.1    0.1       0 806.0M 172.0M   0|0  0|70    4m    63k   83 cap1442  PRI 04:07:36
    *0    *0  30149     *0       0   301|0     0.0    0.0       0 810.0M 176.0M   0|0  0|62    4m    59k   83 cap1442  PRI 04:07:37
    *0    *0  33550     *0       0   338|0     0.0    0.0       0 810.0M 178.0M   0|0  0|69    5m    65k   83 cap1442  PRI 04:07:38
    *0    *0  33399     *0       0   332|0     0.0    0.0       0 811.0M 178.0M   0|0  0|53    5m    64k   83 cap1442  PRI 04:07:39
    *0    *0  30400     *0       0   307|0     0.1    0.1       0 811.0M 178.0M   0|0  0|71    4m    60k   83 cap1442  PRI 04:07:40
    *0    *0  31555     *0       0   316|0     0.0    0.0       0 813.0M 181.0M   0|0  0|69    4m    62k   83 cap1442  PRI 04:07:41
    *0    *0  30165     *0       0   309|0     0.0    0.0       0 820.0M 186.0M   0|0  0|66    4m    60k   83 cap1442  PRI 04:07:42
    *0    *0  33189     *0       0   326|0     0.1    0.1       0 829.0M 196.0M   0|0  0|74    5m    63k   83 cap1442  PRI 04:07:43
    *0    *0  32086     *0       0   324|0     0.1    0.1       0 841.0M 209.0M   0|0  0|65    4m    63k   83 cap1442  PRI 04:07:44
    *0    *0   2503     *0       0    22|0 29620464110.3 29620464110.3       0 866.0M 210.0M   0|0  0|80  347k    19k   83 cap1442  PRI 04:07:45
insert query update delete getmore command       % dirty        % used flushes  vsize    res qr|qw ar|aw netIn netOut conn     set repl     time
    *0    *0    948     *0       0    11|0 29620464110.3 29620464110.3       0 866.0M 210.0M   0|0  0|80  132k    18k   83 cap1442  PRI 04:07:46
    *0    *0    878     *0       0    12|0 29620464110.3 29620464110.3       0 866.0M 210.0M   0|0  0|80  122k    18k   83 cap1442  PRI 04:07:47
    *0    *0    987     *0       0    12|0 29620464110.3 29620464110.3       0 866.0M 210.0M   0|0  0|80  137k    18k   83 cap1442  PRI 04:07:48
    *0    *0    626     *0       0     7|0 29620464110.3 29620464110.3       0 866.0M 210.0M   0|0  0|80   87k    17k   83 cap1442  PRI 04:07:49
    *0    *0    356     *0       0     6|0 29620464110.3 29620464110.3       0 866.0M 210.0M   0|0  0|80   50k    17k   83 cap1442  PRI 04:07:50
    *0    *0     61     *0       0     1|0 29620464110.3 29620464110.3       0 866.0M 210.0M   0|0  0|80    8k    16k   83 cap1442  PRI 04:07:51
    *0    *0    306     *0       0     5|0 29620464110.3 29620464110.3       0 866.0M 210.0M   0|0  0|80   43k    17k   83 cap1442  PRI 04:07:52
    *0    *0    995     *0       0    10|0 29620464110.3 29620464110.3       0 866.0M 210.0M   0|0  0|80  138k    17k   83 cap1442  PRI 04:07:53
    *0    *0    415     *0       0     5|0 29620464110.3 29620464110.3       0 866.0M 210.0M   0|0  0|80   58k    17k   83 cap1442  PRI 04:07:54
    *0    *0     56     *0       0     3|0 29620464110.3 29620464110.3       0 866.0M 210.0M   0|0  0|80    8k    16k   83 cap1442  PRI 04:07:55

"top" output shows a correlation between the number of connections and the amount of CPU used by mongod. At the beginning of the tests when we pumped out high throughput, CPU utilization was very high. After the system got caught in this funny state with very low throughput, CPU start dropping quickly. Eventually, CPU utilization by mongod seems to settle down around 200% when the number of connections stayed at 4. Somehow we seem to have 3 connections that are stuck (in successful tests, the connection count drops to 1).

Tasks: 555 total,   1 running, 554 sleeping,   0 stopped,   0 zombie
%Cpu(s):  6.3 us,  0.0 sy,  0.0 ni, 93.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  61406272 total,  1460576 used, 59945696 free,     6172 buffers
KiB Swap:        0 total,        0 used,        0 free.   529896 cached Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                             
 59083 ec2-user  20   0  836692 224424   6776 S 200.5  0.4  77:13.04 mongod                                                                                              
 59696 ec2-user  20   0  725296  15948   8280 S   0.7  0.0   0:00.24 mongo                                                                                               
 60570 ec2-user  20   0  123948   2056   1160 R   0.3  0.0   0:00.05 top

It is very repeatable and I will put the repo steps and scripts in.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

IncFewSmallDoc.js
1 kB
Jan 27 2015 10:08:18 AM UTC
cap-1442.log
2.22 MB
Jan 27 2015 10:14:24 AM UTC

related to

SERVER-17100 mongod hung running mongo-perf

Closed

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates