Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-17074

Sharded Replicaset - replicas fall behind (3.0.0-rc6)

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.0.0-rc6
    • Component/s: Replication, WiredTiger
    • None
    • Environment:
      Centos 6
    • Fully Compatible
    • Linux
    • Hide

      Start sharded replica set with (we have 8) say, 2 shards, 1 primary and 1 replicaset.
      Pump in 4k updates/sec (each update is a push/pop on a 4kb doc).
      Watch replicaset have 0 updates/sec in mongostat and replication delay (via MMS) show increasing numbers. Occasionally large numbers of updates will show going through the replica, then stop again, but net replication delay alway increases with time.

      Show
      Start sharded replica set with (we have 8) say, 2 shards, 1 primary and 1 replicaset. Pump in 4k updates/sec (each update is a push/pop on a 4kb doc). Watch replicaset have 0 updates/sec in mongostat and replication delay (via MMS) show increasing numbers. Occasionally large numbers of updates will show going through the replica, then stop again, but net replication delay alway increases with time.

      We're seeing our replicaset not able to keep up with the primary in a peculiar way.

      Previously we were on 2.6 and the replication worked fine, no changes since then except upgrading to 3.0.0-rc6.

      I see (via mongostat) primaries getting approx. 4k updates/sec each times 8 shards; secondaries show 0 updates/sec. I stop the replica daemon, wipe the directory, and restart. The resync starts and executes properly, catching up and going into 'SEC' mode on mongostat. This lasts only several seconds before the updates/sec on SEC goes to 0. Primary is still 4k updates/sec.

      Logs on secondaries show lots of these kind of messages:

      2015-01-26T14:02:48.942-0600 I QUERY    [conn193] killcursors  keyUpdates:0 writeConflicts:0 numYields:0 11777ms
      2015-01-26T14:02:48.942-0600 I QUERY    [conn109] killcursors  keyUpdates:0 writeConflicts:0 numYields:0 11717ms
      2015-01-26T14:02:48.943-0600 I QUERY    [conn133] killcursors  keyUpdates:0 writeConflicts:0 numYields:0 11702ms
      2015-01-26T14:02:48.943-0600 I QUERY    [conn206] killcursors  keyUpdates:0 writeConflicts:0 numYields:0 11691ms
      2015-01-26T14:02:48.943-0600 I QUERY    [conn156] killcursors  keyUpdates:0 writeConflicts:0 numYields:0 11681ms
      2015-01-26T14:03:01.363-0600 I NETWORK  [conn218] end connection 10.235.67.65:18027 (113 connections now open)
      

      I've updated several times through rc4, rc5, rc6, and am now even running the nightly, all show the same behavior.

      Note this is a very write-intensive application. Data is stored on SSD's, journals on spinning disk, but I've tried moving journals to SSD and it hasn't helped.

            Assignee:
            Unassigned Unassigned
            Reporter:
            justanyone Kevin J. Rice
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: