-
Type: Bug
-
Resolution: Done
-
Priority: Critical - P2
-
Affects Version/s: 3.0.0-rc6
-
Component/s: WiredTiger
-
None
-
Fully Compatible
-
ALL
-
Ran into this issue while running a mongo-perf test against wiredTiger. In a collection with 100 documents and a high number of concurrent threads (seem this with thread count > 80) updating them, throughput would drop significantly after 10 to 20 seconds. After this occured, commands like "show dbs" in the shell would not return.
Additional information observed so far.
- Saw this in a single node replset. Without oplog, did not see this issue.
- Only saw this in RC6. Did not see this in RC5.
- Have been able to consistently reproduce this on a couple different machines.
mongostat output when this is happening. Notice how the throughput drop is corresponding to the very suspicious change in the "%dirty" and "%used" columns. Almost seems like some kind of overflow/underflow has occured.
insert query update delete getmore command % dirty % used flushes vsize res qr|qw ar|aw netIn netOut conn set repl time *0 *0 *0 *0 0 1|0 0.0 0.0 0 626.0M 46.0M 0|0 0|0 79b 16k 2 cap1442 PRI 04:07:26 100 *0 15019 *0 0 114|0 0.0 0.0 0 723.0M 90.0M 0|0 0|75 2m 58k 83 cap1442 PRI 04:07:27 *0 *0 30118 *0 0 303|0 0.0 0.0 0 740.0M 107.0M 0|0 0|76 4m 60k 83 cap1442 PRI 04:07:28 *0 *0 33133 *0 0 328|0 0.0 0.0 0 752.0M 119.0M 0|0 0|74 5m 63k 83 cap1442 PRI 04:07:29 *0 *0 33137 *0 0 334|0 0.0 0.0 0 761.0M 128.0M 0|0 0|71 5m 64k 83 cap1442 PRI 04:07:30 *0 *0 28873 *0 0 286|0 0.0 0.0 0 767.0M 133.0M 0|0 0|70 4m 57k 83 cap1442 PRI 04:07:31 *0 *0 31345 *0 0 321|0 0.0 0.0 0 776.0M 143.0M 0|0 0|77 4m 62k 83 cap1442 PRI 04:07:32 *0 *0 31867 *0 0 318|0 0.0 0.0 0 785.0M 151.0M 0|0 0|64 4m 62k 83 cap1442 PRI 04:07:33 *0 *0 33964 *0 0 345|0 0.0 0.0 0 795.0M 162.0M 0|0 0|76 5m 66k 83 cap1442 PRI 04:07:34 *0 *0 32205 *0 0 323|0 0.1 0.1 0 799.0M 166.0M 0|0 0|80 4m 62k 83 cap1442 PRI 04:07:35 insert query update delete getmore command % dirty % used flushes vsize res qr|qw ar|aw netIn netOut conn set repl time *0 *0 32151 *0 0 325|0 0.1 0.1 0 806.0M 172.0M 0|0 0|70 4m 63k 83 cap1442 PRI 04:07:36 *0 *0 30149 *0 0 301|0 0.0 0.0 0 810.0M 176.0M 0|0 0|62 4m 59k 83 cap1442 PRI 04:07:37 *0 *0 33550 *0 0 338|0 0.0 0.0 0 810.0M 178.0M 0|0 0|69 5m 65k 83 cap1442 PRI 04:07:38 *0 *0 33399 *0 0 332|0 0.0 0.0 0 811.0M 178.0M 0|0 0|53 5m 64k 83 cap1442 PRI 04:07:39 *0 *0 30400 *0 0 307|0 0.1 0.1 0 811.0M 178.0M 0|0 0|71 4m 60k 83 cap1442 PRI 04:07:40 *0 *0 31555 *0 0 316|0 0.0 0.0 0 813.0M 181.0M 0|0 0|69 4m 62k 83 cap1442 PRI 04:07:41 *0 *0 30165 *0 0 309|0 0.0 0.0 0 820.0M 186.0M 0|0 0|66 4m 60k 83 cap1442 PRI 04:07:42 *0 *0 33189 *0 0 326|0 0.1 0.1 0 829.0M 196.0M 0|0 0|74 5m 63k 83 cap1442 PRI 04:07:43 *0 *0 32086 *0 0 324|0 0.1 0.1 0 841.0M 209.0M 0|0 0|65 4m 63k 83 cap1442 PRI 04:07:44 *0 *0 2503 *0 0 22|0 29620464110.3 29620464110.3 0 866.0M 210.0M 0|0 0|80 347k 19k 83 cap1442 PRI 04:07:45 insert query update delete getmore command % dirty % used flushes vsize res qr|qw ar|aw netIn netOut conn set repl time *0 *0 948 *0 0 11|0 29620464110.3 29620464110.3 0 866.0M 210.0M 0|0 0|80 132k 18k 83 cap1442 PRI 04:07:46 *0 *0 878 *0 0 12|0 29620464110.3 29620464110.3 0 866.0M 210.0M 0|0 0|80 122k 18k 83 cap1442 PRI 04:07:47 *0 *0 987 *0 0 12|0 29620464110.3 29620464110.3 0 866.0M 210.0M 0|0 0|80 137k 18k 83 cap1442 PRI 04:07:48 *0 *0 626 *0 0 7|0 29620464110.3 29620464110.3 0 866.0M 210.0M 0|0 0|80 87k 17k 83 cap1442 PRI 04:07:49 *0 *0 356 *0 0 6|0 29620464110.3 29620464110.3 0 866.0M 210.0M 0|0 0|80 50k 17k 83 cap1442 PRI 04:07:50 *0 *0 61 *0 0 1|0 29620464110.3 29620464110.3 0 866.0M 210.0M 0|0 0|80 8k 16k 83 cap1442 PRI 04:07:51 *0 *0 306 *0 0 5|0 29620464110.3 29620464110.3 0 866.0M 210.0M 0|0 0|80 43k 17k 83 cap1442 PRI 04:07:52 *0 *0 995 *0 0 10|0 29620464110.3 29620464110.3 0 866.0M 210.0M 0|0 0|80 138k 17k 83 cap1442 PRI 04:07:53 *0 *0 415 *0 0 5|0 29620464110.3 29620464110.3 0 866.0M 210.0M 0|0 0|80 58k 17k 83 cap1442 PRI 04:07:54 *0 *0 56 *0 0 3|0 29620464110.3 29620464110.3 0 866.0M 210.0M 0|0 0|80 8k 16k 83 cap1442 PRI 04:07:55
"top" output shows a correlation between the number of connections and the amount of CPU used by mongod. At the beginning of the tests when we pumped out high throughput, CPU utilization was very high. After the system got caught in this funny state with very low throughput, CPU start dropping quickly. Eventually, CPU utilization by mongod seems to settle down around 200% when the number of connections stayed at 4. Somehow we seem to have 3 connections that are stuck (in successful tests, the connection count drops to 1).
Tasks: 555 total, 1 running, 554 sleeping, 0 stopped, 0 zombie %Cpu(s): 6.3 us, 0.0 sy, 0.0 ni, 93.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem: 61406272 total, 1460576 used, 59945696 free, 6172 buffers KiB Swap: 0 total, 0 used, 0 free. 529896 cached Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 59083 ec2-user 20 0 836692 224424 6776 S 200.5 0.4 77:13.04 mongod 59696 ec2-user 20 0 725296 15948 8280 S 0.7 0.0 0:00.24 mongo 60570 ec2-user 20 0 123948 2056 1160 R 0.3 0.0 0:00.05 top
It is very repeatable and I will put the repo steps and scripts in.
- related to
-
SERVER-17100 mongod hung running mongo-perf
- Closed