Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-21524

Secondary Replication Failure

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Critical - P2 Critical - P2
    • None
    • Affects Version/s: 3.0.6
    • Component/s: Replication
    • None
    • ALL
    • Hide

      NA (unfortunately)

      Show
      NA (unfortunately)

      We have a sharded mongo cluster with 3 shards. Each shard contains two active members and an arbitrator. Most collections are sharded across this cluster.

      For the past 2 weeks the secondary in shard 0 has continuously ended up stale. Every time this has happened we have been attempting to determine the cause of the replication error, without any luck.

      Using the article https://docs.mongodb.org/manual/tutorial/troubleshoot-replica-sets/ we have ruled out the below causes of replication error (evidence included in this bug report).

      All evidence covers the timespan of the last secondary failure which occurred at approx. 2015/11/17 20:23 GMT.

      All times are UTC.

      Network Latency

      See attached RS0-Primary-Metrics.png & RS0-Secondary-Metrics.png

      Network throughput between the instances was lower for the duration of this period than it is during initial sync.

      Disk Throughput

      See attached RS0-Primary-MongoDiskMetrics.png & RS0-Secondary-MongoDiskMetrics.png

      Disk throughout is limit to approx 750 total ops/s. Neither disk on the primary or secondary is reaching this limit.

      Concurrency

      See attached mongod_rs0_primary.log

      There does not appear to be a higher than normal occurrence of slow queries or long-running operations over the period where the secondary becomes stale.

      Appropriate Write Concern

      Our data ingestion rate is relatively consistent and slowly varying over any 24 hours period.

      The write concern used is always WriteConcern.ACKNOWLEDGED.

      OpLog Size

      See attached RS0-Primary-OplogLength.png & RS0-Primary-OplogLength.xlsx

      The opLog length is never less than 30 minutes (equivalent to 30GB of allocated oplog space).

      Configuration Setup

      All instances in cluster are hosted on AWS using the m3.2xlarge instance size.

      IP Addresses for the different cluster members are as follows:

      Shard 0 (Contains unsharded collections)

      Primary - 10.0.1.179
      Secondary - 10.0.2.145
      Arbiter - 10.0.1.109

      Shard 1

      Primary - 10.0.1.180
      Secondary - 10.0.2.146
      Arbiter - 10.0.1.108

      Shard 2

      Primary - 10.0.1.212
      Secondary - 10.0.2.199
      Arbiter - 10.0.1.246

      The secondaries are configured out of the box, i.e. no reads are allowed from secondaries.

      To date we have been unable to locate the cause of this replication issue. As such we would like to report the above as a bug in the hope that you might be able to either explain the issue or resolve the problem in an upcoming release.

      For the time being when the replication fails we have just been performing an initial sync on the failed secondary.

        1. after.png
          154 kB
          Bruce Lucas
        2. before.png
          142 kB
          Bruce Lucas
        3. RS0-Primary-Metrics.png
          156 kB
          Marc Fletcher
        4. RS0-Primary-MongoDiskMetrics.png
          205 kB
          Marc Fletcher
        5. RS0-Primary-OplogLength.png
          99 kB
          Marc Fletcher
        6. RS0-Primary-OplogLength.xlsx
          1.97 MB
          Marc Fletcher
        7. RS0-Secondary-Metrics.png
          132 kB
          Marc Fletcher
        8. RS0-Secondary-MongoDiskMetrics.png
          135 kB
          Marc Fletcher
        9. Secondary-ss-log.png
          175 kB
          Kelsey Schubert
        10. transition.png
          130 kB
          Bruce Lucas

            Assignee:
            kelsey.schubert@mongodb.com Kelsey Schubert
            Reporter:
            MarcF Marc Fletcher
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: