Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-23798

Increased ns file IO in 3.0

    • Type: Icon: Bug Bug
    • Resolution: Won't Fix
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.0.9
    • Component/s: MMAPv1
    • Storage Execution
    • ALL
    • Hide

      Create a MongoDB 2.6 instance using MMAPv1 with enough databases that the cumulative size of their ns files is greater than available physical memory on the server.

      Monitor the filesystem cache usage and disk IO on the server.

      Upgrade this server to MongoDB 3.0 (still using MMAPv1) and monitor the same metrics.

      Show
      Create a MongoDB 2.6 instance using MMAPv1 with enough databases that the cumulative size of their ns files is greater than available physical memory on the server. Monitor the filesystem cache usage and disk IO on the server. Upgrade this server to MongoDB 3.0 (still using MMAPv1) and monitor the same metrics.

      Following upgrades from 2.6.9 to 3.0.9 (still using MMAPv1) we noticed significantly higher disk IO against the volume hosting MongoDB's data files.

      This has become particularly apparent on replica sets with large numbers of databases (multiple thousands).

      From investigation, this appears to be caused by a change in MongoDB's behaviour when reading ns files.

      To give a precise example, we have a replica set that is currently in the process of being upgraded. It has 3 x 2.6.9 nodes and 1 x 3.0.9 node (hidden, non-voting).

      The replica set has 5570 databases and uses the 16MB default ns size. If MongoDB loaded all of these ns files into memory, it would require 87GB of memory.

      The existing 2.6.9 nodes run comfortably as EC2 r3.larges (14GB RAM), and running vmtouch shows that only a tiny percentage of the pages of the ns files are loaded into the filesystem cache:

      # ./vmtouch -v /var/lib/mongodb/*.ns | tail -5
      
                 Files: 5570
           Directories: 0
        Resident Pages: 188549/22814720  736M/87G  0.826%
               Elapsed: 0.97846 seconds
      

      However, running the 3.0.9 node as an r3.large makes it unusable, as the filesystem cache is constantly flooded with the ns files (and the server takes 1hr 26 mins to start):

      # ./vmtouch -v /var/lib/mongodb/*.ns | tail -5
      
                 Files: 5570
           Directories: 0
        Resident Pages: 2905047/22814720  11G/87G  12.7%
               Elapsed: 0.67599 seconds
      

      The server is then constantly performing significant amounts of read IO, I presume to keep trying to retain the entire contents of the ns files in memory:

      # iostat -x 1 xvdg
      Linux 3.13.0-77-generic (SERVER) 	04/19/2016 	_x86_64_	(2 CPU)
      
      avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                 3.43    0.06    2.26   46.98    0.62   46.65
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
      xvdg              0.28     1.57 2185.88   21.08 33805.04   521.00    31.11     2.68    1.21    0.80   43.97   0.43  94.96
      
      avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                18.75    0.00    3.12   40.62    0.00   37.50
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
      xvdg              0.00     1.00 2430.00   73.00 37996.00   480.00    30.74     1.72    0.69    0.68    0.99   0.35  88.40
      
      avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                 6.28    0.00    3.14   45.03    0.00   45.55
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
      xvdg              0.00     0.00 2285.00    0.00 35184.00     0.00    30.80     1.65    0.72    0.72    0.00   0.40  92.00
      
      avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                 1.57    0.00    3.66   45.55    0.52   48.69
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
      xvdg              0.00    81.00 2525.00  136.00 40132.00 16740.00    42.74     9.04    3.40    0.64   54.56   0.36  95.60
      

      Changing the instance type to an r3.4xlarge (122GB) alleviates the problem, as there is now enough memory for all of the ns files to be constantly loaded (and the server starts in 35 minutes with the IO subsystem being the limiting factor):

      # ./vmtouch -v /var/lib/mongodb/*.ns | tail -5
      
                 Files: 5572
           Directories: 0
        Resident Pages: 22822912/22822912  87G/87G  100%
               Elapsed: 0.94295 seconds
      

      This isn't a feasible option for us though, as the cost of one of the r3.4xlarge instances is $1,102 for a 31 day month compared to $137 for an r3.large instance. (And clearly across a 3-node replica set this is a lot of money).

            Assignee:
            backlog-server-execution [DO NOT USE] Backlog - Storage Execution Team
            Reporter:
            gregmurphy Greg Murphy
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: