XML

Word

Printable

JSON

Type: Bug
Resolution: Duplicate
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 3.6.5
Component/s: WiredTiger
Labels:
- SWKB

Operating System:
ALL
Steps To Reproduce:

Hide

Under provision the secondary so that it eventually runs out of memory and crashes.

Show
Under provision the secondary so that it eventually runs out of memory and crashes.
Confidence Status:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

Environment:

3 server replica sets on AWS (t2.medium) running 3.6.5.
All writes employ MAJORITY write concern.
Default journaling is enabled.

Expected behaviour:

That the supported recovery methods return the instance to health.

Observed Behaviour:

After an unclean shutdown a secondary never recovers on it's own, never making it past the final step in the following sample log, see log1.txt.

Clearly the instance had run out of disk space at this point (100GB provisioned for a database normally 1.6GB). Here are the contents of the /var/mongodata folder, see log2.txt.

So it appears the culprit is entirely the WiredTigerLAS.wt file.

For additional information the df output at this point:

Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        7.9G   60K  7.9G   1% /dev
tmpfs           7.9G     0  7.9G   0% /dev/shm
/dev/xvda1       20G  2.8G   17G  14% /
/dev/xvdi       100G  100G  140K 100% /var/mongodata

The only option to recover this instance is to do a full resync (after deleting the contents of /var/mongodata), see log3.txt.

The initial sync currently takes less than 60 seconds but this will obviously not be suitable once the size of the data set grows.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

CPU.png
119 kB
Jun 07 2018 08:00:37 PM UTC
image-2018-06-07-09-53-02-151.png
153 kB
Jun 07 2018 08:53:03 AM UTC
image-2018-06-07-09-55-51-824.png
125 kB
Jun 07 2018 08:55:52 AM UTC
image-2018-06-08-17-14-04-212.png
49 kB
Jun 08 2018 04:14:05 PM UTC
log1.txt
14 kB
Jun 01 2018 02:26:00 PM UTC
log2.txt
4 kB
Jun 01 2018 02:28:36 PM UTC
log3.txt
44 kB
Jun 01 2018 02:29:19 PM UTC
top_level_replicaset_stats.JPG
70 kB
Jun 05 2018 05:49:34 PM UTC

duplicates

SERVER-34938 Secondary slowdown or hang due to content pinned in cache by single oplog batch

Closed

SERVER-34941 Add testing to cover cases where timestamps cause cache pressure

Closed

is related to

SERVER-36495 Cache pressure issues during recovery oplog application

Closed

Assignee:: Bruce Lucas (Inactive)

Reporter:: Marc Fletcher

Participants:: Bruce Lucas, Dmitry Agranat, Marc Fletcher

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: Jun 01 2018 02:29:27 PM UTC

Updated:: Dec 27 2018 05:08:19 AM UTC

Resolved:: Nov 17 2018 04:08:30 AM UTC

GA Target Date:: None

Public Preview Target Date:: None

Private Preview Target Date:: None

Experiment Target Date:: None

Details

Description

Environment:

Expected behaviour:

Observed Behaviour:

Attachments

Attachments

Issue Links

Activity

People

Dates