Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Critical - P2
Fix Version/s: None
Affects Version/s: 3.0.6
Component/s: Replication
Labels:
None

Operating System:
ALL
Steps To Reproduce:

Hide

NA (unfortunately)

Show
NA (unfortunately)

We have a sharded mongo cluster with 3 shards. Each shard contains two active members and an arbitrator. Most collections are sharded across this cluster.

For the past 2 weeks the secondary in shard 0 has continuously ended up stale. Every time this has happened we have been attempting to determine the cause of the replication error, without any luck.

Using the article https://docs.mongodb.org/manual/tutorial/troubleshoot-replica-sets/ we have ruled out the below causes of replication error (evidence included in this bug report).

All evidence covers the timespan of the last secondary failure which occurred at approx. 2015/11/17 20:23 GMT.

All times are UTC.

Network Latency

See attached RS0-Primary-Metrics.png & RS0-Secondary-Metrics.png

Network throughput between the instances was lower for the duration of this period than it is during initial sync.

Disk Throughput

See attached RS0-Primary-MongoDiskMetrics.png & RS0-Secondary-MongoDiskMetrics.png

Disk throughout is limit to approx 750 total ops/s. Neither disk on the primary or secondary is reaching this limit.

Concurrency

See attached mongod_rs0_primary.log

There does not appear to be a higher than normal occurrence of slow queries or long-running operations over the period where the secondary becomes stale.

Appropriate Write Concern

Our data ingestion rate is relatively consistent and slowly varying over any 24 hours period.

The write concern used is always WriteConcern.ACKNOWLEDGED.

OpLog Size

See attached RS0-Primary-OplogLength.png & RS0-Primary-OplogLength.xlsx

The opLog length is never less than 30 minutes (equivalent to 30GB of allocated oplog space).

Configuration Setup

All instances in cluster are hosted on AWS using the m3.2xlarge instance size.

IP Addresses for the different cluster members are as follows:

Shard 0 (Contains unsharded collections)

Primary - 10.0.1.179
Secondary - 10.0.2.145
Arbiter - 10.0.1.109

Shard 1

Primary - 10.0.1.180
Secondary - 10.0.2.146
Arbiter - 10.0.1.108

Shard 2

Primary - 10.0.1.212
Secondary - 10.0.2.199
Arbiter - 10.0.1.246

The secondaries are configured out of the box, i.e. no reads are allowed from secondaries.

To date we have been unable to locate the cause of this replication issue. As such we would like to report the above as a bug in the hope that you might be able to either explain the issue or resolve the problem in an upcoming release.

For the time being when the replication fails we have just been performing an initial sync on the failed secondary.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

after.png
154 kB
Dec 07 2015 04:05:46 PM UTC
before.png
142 kB
Dec 07 2015 04:05:46 PM UTC
mongod_rs0_primary.log.zip
6.97 MB
Nov 18 2015 01:26:22 PM UTC
mongod_rs0_secondary.log.zip
301 kB
Nov 18 2015 01:26:22 PM UTC
RS0-Primary-Metrics.png
156 kB
Nov 18 2015 01:26:22 PM UTC
RS0-Primary-MongoDiskMetrics.png
205 kB
Nov 18 2015 01:26:22 PM UTC
RS0-Primary-OplogLength.png
99 kB
Nov 18 2015 01:26:22 PM UTC
RS0-Primary-OplogLength.xlsx
1.97 MB
Nov 18 2015 01:26:22 PM UTC
RS0-Secondary-Metrics.png
132 kB
Nov 18 2015 01:26:22 PM UTC
RS0-Secondary-MongoDiskMetrics.png
135 kB
Nov 18 2015 01:26:22 PM UTC
Secondary-ss-log.png
175 kB
Dec 07 2015 06:31:14 AM UTC
server_logs_15_11_25.zip
80.39 MB
Nov 25 2015 12:21:53 PM UTC
transition.png
130 kB
Dec 07 2015 04:05:46 PM UTC

Assignee:: Kelsey Schubert

Reporter:: Marc Fletcher

Participants:: Bruce Lucas, Kelsey Schubert, Marc Fletcher, Ramon Fernandez Marina

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: Nov 18 2015 01:26:22 PM UTC

Updated:: Jan 04 2016 06:43:42 PM UTC

Resolved:: Jan 04 2016 06:43:42 PM UTC

Details

Description