Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Duplicate
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Operating System:
ALL
Confidence Status:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

We upgraded a secondary of a 3 node cluster to 3.2.9.

By default when we upgrade we use iptables to allow replication but block clients.

Upon allowing clients cache went up to ~96% and failed to drop. Only 1 (of 16) cores appeared to be in use.

Blocking clients, restarting and allowing replication caused the oplog to catch up but still over time the cache fills and the performance hits rock bottom.

mongostat

insert query update delete getmore command % dirty % used flushes vsize   res qr|qw ar|aw netIn netOut conn   set repl                 time
    *6    *0    *14     *1       0    14|0     1.3   96.0       0 21.9G 21.1G   0|0  0|16 1.23k   127k   16 floow  SEC 2016-09-10T20:18:27Z
    *1    *0    *21     *2       0    13|0     1.3   96.0       0 21.9G 21.1G   0|0  0|16 1.07k   127k   16 floow  SEC 2016-09-10T20:18:28Z
    *0    *0     *0     *0       0     9|0     1.3   96.0       0 21.9G 21.1G   0|0  0|16  917b  93.2k   16 floow  SEC 2016-09-10T20:18:29Z
    *9    *0    *29     *1       0    12|0     1.3   96.0       0 21.9G 21.1G   0|0  0|16 1.01k   126k   16 floow  SEC 2016-09-10T20:18:30Z
    *2    *0     *4     *1       0    13|0     1.3   96.0       0 21.9G 21.1G   0|0  0|16 1.17k   126k   16 floow  SEC 2016-09-10T20:18:31Z
   *24    *0   *161    *10       0    14|0     1.3   96.0       0 21.9G 21.1G   0|0  0|15 1.13k   127k   16 floow  SEC 2016-09-10T20:18:32Z

iostat:

09/10/2016 08:18:51 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           6.51    0.00    0.06    0.13    0.00   93.30

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvda              0.00     0.00   22.00    0.00    92.00     0.00     8.36     0.04    2.00    2.00    0.00   0.18   0.40
xvdh              0.00    99.00   21.00   17.00   260.00  1292.00    81.68     0.06    1.47    0.57    2.59   0.63   2.40
xvdz              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

replication status (command took 10-15min to return):

db-node2(mongod-3.2.9)[SECONDARY:floow] test>rs.printReplicationInfo()
configured oplog size:   614400MB
log length start to end: 958414secs (266.23hrs)
oplog first event time:  Tue Aug 30 2016 14:06:44 GMT+0000 (UTC)
oplog last event time:   Sat Sep 10 2016 16:20:18 GMT+0000 (UTC)
now:                     Sat Sep 10 2016 20:25:01 GMT+0000 (UTC)

Upon restart (which often takes ages) replication catches up but then the cache fills and the scenario repeats.

Note: Other nodes are running 3.0 still.

I also experimented with changing WT parameters with no joy.

We will downgrade but leaving at 3.2.9 with low priority for now to allow for diagnostics and logs if required.

With 3.0 we still have cache filling issues but they occur once or twice a month, with our workload mmap was pretty much maintenance-free (very stable, minimal issues [except the disk usage], 3.0 WT causes some pain but it's manageable, 3.2 WT is unusable.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

graphite.thefloow.net.png
147 kB
Sep 16 2016 03:58:31 PM UTC
application-eviction-threads.png
215 kB
Sep 13 2016 10:19:46 PM UTC

duplicates

SERVER-25974 Application threads stall for extended period when cache fills

Closed

Assignee:: Kelsey Schubert

Reporter:: Paul Ridgway

Participants:: Adrien Jarthon, Alexander Gorrod, Bartosz Debski, Daniel Pasette, Kelsey Schubert, Paul Ridgway

Votes:: 2 Vote for this issue

Watchers:: 15 Start watching this issue

Created:: Sep 10 2016 08:26:12 PM UTC

Updated:: Nov 19 2016 03:22:29 PM UTC

Resolved:: Nov 19 2016 03:22:29 PM UTC

GA Target Date:: None

Public Preview Target Date:: None

Private Preview Target Date:: None

Experiment Target Date:: None

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates