Following upgrades from 2.6.9 to 3.0.9 (still using MMAPv1) we noticed significantly higher disk IO against the volume hosting MongoDB's data files.
This has become particularly apparent on replica sets with large numbers of databases (multiple thousands).
From investigation, this appears to be caused by a change in MongoDB's behaviour when reading ns files.
To give a precise example, we have a replica set that is currently in the process of being upgraded. It has 3 x 2.6.9 nodes and 1 x 3.0.9 node (hidden, non-voting).
The replica set has 5570 databases and uses the 16MB default ns size. If MongoDB loaded all of these ns files into memory, it would require 87GB of memory.
The existing 2.6.9 nodes run comfortably as EC2 r3.larges (14GB RAM), and running vmtouch shows that only a tiny percentage of the pages of the ns files are loaded into the filesystem cache:
# ./vmtouch -v /var/lib/mongodb/*.ns | tail -5
Files: 5570
Directories: 0
Resident Pages: 188549/22814720 736M/87G 0.826%
Elapsed: 0.97846 seconds
However, running the 3.0.9 node as an r3.large makes it unusable, as the filesystem cache is constantly flooded with the ns files (and the server takes 1hr 26 mins to start):
# ./vmtouch -v /var/lib/mongodb/*.ns | tail -5
Files: 5570
Directories: 0
Resident Pages: 2905047/22814720 11G/87G 12.7%
Elapsed: 0.67599 seconds
The server is then constantly performing significant amounts of read IO, I presume to keep trying to retain the entire contents of the ns files in memory:
# iostat -x 1 xvdg
Linux 3.13.0-77-generic (SERVER) 04/19/2016 _x86_64_ (2 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
3.43 0.06 2.26 46.98 0.62 46.65
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
xvdg 0.28 1.57 2185.88 21.08 33805.04 521.00 31.11 2.68 1.21 0.80 43.97 0.43 94.96
avg-cpu: %user %nice %system %iowait %steal %idle
18.75 0.00 3.12 40.62 0.00 37.50
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
xvdg 0.00 1.00 2430.00 73.00 37996.00 480.00 30.74 1.72 0.69 0.68 0.99 0.35 88.40
avg-cpu: %user %nice %system %iowait %steal %idle
6.28 0.00 3.14 45.03 0.00 45.55
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
xvdg 0.00 0.00 2285.00 0.00 35184.00 0.00 30.80 1.65 0.72 0.72 0.00 0.40 92.00
avg-cpu: %user %nice %system %iowait %steal %idle
1.57 0.00 3.66 45.55 0.52 48.69
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
xvdg 0.00 81.00 2525.00 136.00 40132.00 16740.00 42.74 9.04 3.40 0.64 54.56 0.36 95.60
Changing the instance type to an r3.4xlarge (122GB) alleviates the problem, as there is now enough memory for all of the ns files to be constantly loaded (and the server starts in 35 minutes with the IO subsystem being the limiting factor):
# ./vmtouch -v /var/lib/mongodb/*.ns | tail -5
Files: 5572
Directories: 0
Resident Pages: 22822912/22822912 87G/87G 100%
Elapsed: 0.94295 seconds
This isn't a feasible option for us though, as the cost of one of the r3.4xlarge instances is $1,102 for a 31 day month compared to $137 for an r3.large instance. (And clearly across a 3-node replica set this is a lot of money).
- is duplicated by
-
SERVER-24824 Mongo 3.0.12 with MMAPv1 can't serve more than 1k qps
- Closed