Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Won't Fix
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 4.0.12
Component/s: WiredTiger
Labels:
- caching
- wiredtiger
Environment:

Hide
Ubuntu 16.04.6 LTS
Linux scorpius 4.15.0-58-generic #64~16.04.1-Ubuntu SMP Wed Aug 7 14:10:35 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

db version v4.0.12
git version: 5776e3cbf9e7afe86e6b29e22520ffb6766e95d4
OpenSSL version: OpenSSL 1.0.2g 1 Mar 2016
allocator: tcmalloc
modules: none
build environment:
    distmod: ubuntu1604
    distarch: x86_64
    target_arch: x86_64

2xIntel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz

              total used free shared buff/cache available
Mem: 125G 372M 61G 2.5M 64G 124G
Swap: 29G 0B 29G

Show
Ubuntu 16.04.6 LTS Linux scorpius 4.15.0-58-generic #64~16.04.1-Ubuntu SMP Wed Aug 7 14:10:35 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux db version v4.0.12 git version: 5776e3cbf9e7afe86e6b29e22520ffb6766e95d4 OpenSSL version: OpenSSL 1.0.2g 1 Mar 2016 allocator: tcmalloc modules: none build environment:     distmod: ubuntu1604     distarch: x86_64     target_arch: x86_64 2xIntel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz               total used free shared buff/cache available Mem: 125G 372M 61G 2.5M 64G 124G Swap: 29G 0B 29G

Assigned Teams:

Replication
Operating System:
ALL
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

After restarting a stale slaveDelay node, the learned higher OpTime in the same term can be forgotten. If the commit point is in a higher term on all other nodes, the slaveDelay node can only advance its commit point on getMore responses due to ~~SERVER-39831~~. On slaveDelay nodes, it's likely the buffer is already full, so the applier has to apply 16MB worth of oplog entries to make the room for bgsync to insert the last fetched batch and call another getMore. Applying 16MB oplog entires may be enough to trigger memory pressure, causing evictions.

The issue will resolve when the slaveDelay node starts to apply oplog entries from the latest term. Memory pressure and evictions on slaveDelay nodes are undesired but not harmful.

The same issue can happen without restart. Let's say an election happens in term 8 at time T0, but the node delays by 5 days and is still applying entries from term 7. At T0 + 2 days, another election occurs in term 9. Now the commit point is in term 9. At T0 + 5 days, when the delayed node starts to apply entries in term 8, it cannot advance its commit point beyond its last applied. Eventually, when the node starts to apply entries from term 9, everything's fine again.

=======================================
Original title and description:
WT eviction threads consume a lot of CPU even when there is no apparent cache pressure

After upgrading from 3.6 to 4.0.12 we encountered an overly high CPU consumption on our slave-delayed hidden replica set member. Restarting the member doesn't help, the CPU consumption goes down, but then goes up after some time.
We recorded some logs, perf traces and statistics snapshots, see attached files. Also included are FTDC files for the relevant interval and some graphs from our monitoring system.

"Before" means before the CPU spike, "after" – after it (occured about 15:47:31 +/- 5s).

When CPU consumption is high, according to `perf report` about 96% of time is spent in `__wt_evict` (see `mongod-after.perf.txt` and `mongod-after.perf.data`). This coincides with `cache overflow score` metric jumping up from 0 to 100 (see `caches-before.log` and `caches-after.log`), despite the `bytes currently in the cache` (5703522791) being much smaller than `maximum bytes configured` (8589934592).

This is a hidden delayed secondary, so there should be next to no load except replicating writes which are pretty low-volume. Before upgrading to 4.0 we did not have any issues regarding this service.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

scorpius-cache.png
Aug 26 2019 02:28:44 PM UTC
24 kB
Aristarkh Zagorodnikov
scorpius-cpu.png
Aug 26 2019 02:28:44 PM UTC
24 kB
Aristarkh Zagorodnikov
scorpius-cpu2.png
Aug 26 2019 02:28:44 PM UTC
35 kB
Aristarkh Zagorodnikov
caches.log
Aug 26 2019 02:29:07 PM UTC
1.41 MB
Aristarkh Zagorodnikov
metrics.2019-08-16T12-02-49Z-00000
Aug 26 2019 02:29:07 PM UTC
644 kB
Aristarkh Zagorodnikov
mongod.log
Aug 26 2019 02:29:08 PM UTC
296 kB
Aristarkh Zagorodnikov
mongod-before.perf.data
Aug 26 2019 02:29:10 PM UTC
1.74 MB
Aristarkh Zagorodnikov
mongod-after.perf.data
Aug 26 2019 02:29:24 PM UTC
27.98 MB
Aristarkh Zagorodnikov
mongodb-driveFS-files-1.conf
Aug 26 2019 02:30:43 PM UTC
0.6 kB
Aristarkh Zagorodnikov
mongod-after.perf.txt
Aug 26 2019 02:30:57 PM UTC
30 kB
Aristarkh Zagorodnikov
mongod-after.perf.txt
Aug 26 2019 02:31:00 PM UTC
30 kB
Aristarkh Zagorodnikov
Hide
mongo-metrics.zip
Sep 13 2019 09:18:22 AM UTC
75.15 MB
Aristarkh Zagorodnikov
Extracting archive...
Show
mongo-metrics.zip
Sep 13 2019 09:18:22 AM UTC
75.15 MB
Aristarkh Zagorodnikov
scorpius-cpu-new.png
Sep 13 2019 09:20:09 AM UTC
52 kB
Aristarkh Zagorodnikov
Hide
metrics-shard-primary.zip
Sep 13 2019 03:41:30 PM UTC
90.98 MB
Aristarkh Zagorodnikov
Extracting archive...
Show
metrics-shard-primary.zip
Sep 13 2019 03:41:30 PM UTC
90.98 MB
Aristarkh Zagorodnikov
Hide
metrics-csrs-primary.zip
Sep 13 2019 03:41:59 PM UTC
83.09 MB
Aristarkh Zagorodnikov
Extracting archive...
Show
metrics-csrs-primary.zip
Sep 13 2019 03:41:59 PM UTC
83.09 MB
Aristarkh Zagorodnikov
Hide
mongod-new-logs.zip
Sep 13 2019 04:01:39 PM UTC
4.95 MB
Aristarkh Zagorodnikov
Extracting archive...
Show
mongod-new-logs.zip
Sep 13 2019 04:01:39 PM UTC
4.95 MB
Aristarkh Zagorodnikov
Hide
mongod-new-logs-2.zip
Sep 17 2019 09:21:32 AM UTC
555 kB
Aristarkh Zagorodnikov
Extracting archive...
Show
mongod-new-logs-2.zip
Sep 17 2019 09:21:32 AM UTC
555 kB
Aristarkh Zagorodnikov
Hide
metrics-csrs-primary-2.zip
Sep 17 2019 09:22:22 AM UTC
59.64 MB
Aristarkh Zagorodnikov
Extracting archive...
Show
metrics-csrs-primary-2.zip
Sep 17 2019 09:22:22 AM UTC
59.64 MB
Aristarkh Zagorodnikov
Screen Shot 2019-09-24 at 1.23.47 PM.png
Sep 24 2019 05:23:55 PM UTC
188 kB
Danny Hatcher
overview.png
Sep 25 2019 06:20:10 PM UTC
305 kB
Bruce Lucas
zoom.png
Sep 25 2019 06:20:14 PM UTC
258 kB
Bruce Lucas
Screen Shot 2019-10-09 at 12.47.34 PM.png
Oct 09 2019 04:48:19 PM UTC
282 kB
Danny Hatcher

is related to

SERVER-43632 Possible memory leak in 4.0

Closed

Assignee:: [DO NOT USE] Backlog - Replication Team
Reporter:: Aristarkh Zagorodnikov
Participants:: [DO NOT USE] Backlog - Replication Team, Aristarkh Zagorodnikov, Bruce Lucas, Danny Hatcher, Tess Avitabile
Votes:: 1 Vote for this issue
Watchers:: 15 Start watching this issue

Created:: Aug 26 2019 02:41:29 PM UTC
Updated:: Dec 06 2022 02:49:41 AM UTC
Resolved:: Jan 03 2020 09:01:03 PM UTC

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates