-
Type: Bug
-
Resolution: Duplicate
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
ALL
We upgraded a secondary of a 3 node cluster to 3.2.9.
By default when we upgrade we use iptables to allow replication but block clients.
Upon allowing clients cache went up to ~96% and failed to drop. Only 1 (of 16) cores appeared to be in use.
Blocking clients, restarting and allowing replication caused the oplog to catch up but still over time the cache fills and the performance hits rock bottom.
mongostat
insert query update delete getmore command % dirty % used flushes vsize res qr|qw ar|aw netIn netOut conn set repl time *6 *0 *14 *1 0 14|0 1.3 96.0 0 21.9G 21.1G 0|0 0|16 1.23k 127k 16 floow SEC 2016-09-10T20:18:27Z *1 *0 *21 *2 0 13|0 1.3 96.0 0 21.9G 21.1G 0|0 0|16 1.07k 127k 16 floow SEC 2016-09-10T20:18:28Z *0 *0 *0 *0 0 9|0 1.3 96.0 0 21.9G 21.1G 0|0 0|16 917b 93.2k 16 floow SEC 2016-09-10T20:18:29Z *9 *0 *29 *1 0 12|0 1.3 96.0 0 21.9G 21.1G 0|0 0|16 1.01k 126k 16 floow SEC 2016-09-10T20:18:30Z *2 *0 *4 *1 0 13|0 1.3 96.0 0 21.9G 21.1G 0|0 0|16 1.17k 126k 16 floow SEC 2016-09-10T20:18:31Z *24 *0 *161 *10 0 14|0 1.3 96.0 0 21.9G 21.1G 0|0 0|15 1.13k 127k 16 floow SEC 2016-09-10T20:18:32Z
iostat:
09/10/2016 08:18:51 PM avg-cpu: %user %nice %system %iowait %steal %idle 6.51 0.00 0.06 0.13 0.00 93.30 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util xvda 0.00 0.00 22.00 0.00 92.00 0.00 8.36 0.04 2.00 2.00 0.00 0.18 0.40 xvdh 0.00 99.00 21.00 17.00 260.00 1292.00 81.68 0.06 1.47 0.57 2.59 0.63 2.40 xvdz 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
replication status (command took 10-15min to return):
db-node2(mongod-3.2.9)[SECONDARY:floow] test>rs.printReplicationInfo() configured oplog size: 614400MB log length start to end: 958414secs (266.23hrs) oplog first event time: Tue Aug 30 2016 14:06:44 GMT+0000 (UTC) oplog last event time: Sat Sep 10 2016 16:20:18 GMT+0000 (UTC) now: Sat Sep 10 2016 20:25:01 GMT+0000 (UTC)
Upon restart (which often takes ages) replication catches up but then the cache fills and the scenario repeats.
Note: Other nodes are running 3.0 still.
I also experimented with changing WT parameters with no joy.
We will downgrade but leaving at 3.2.9 with low priority for now to allow for diagnostics and logs if required.
With 3.0 we still have cache filling issues but they occur once or twice a month, with our workload mmap was pretty much maintenance-free (very stable, minimal issues [except the disk usage], 3.0 WT causes some pain but it's manageable, 3.2 WT is unusable.
- duplicates
-
SERVER-25974 Application threads stall for extended period when cache fills
- Closed