Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-13117

Secondary repeatedly fails to sync

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Blocker - P1 Blocker - P1
    • None
    • Affects Version/s: 2.4.5, 2.4.6, 2.4.7, 2.4.8, 2.4.9
    • Component/s: Replication
    • None
    • ALL

      I've gotten this error many tens of times on 2.4.x

      Symptoms:

      • Secondary clones dbs
      • Secondary builds indexes
      • Immediately upon completion, secondary resyncs again
      • My countenance darkens and my heart sinks into despair
      • Sometimes (maybe 10% of the time), it actually finishes initial sync

      Setup

      • Replica set with ~1.5TB data
      • Biggest collection ~500GB data, 80M docs with 4 indexes 3GB-5GB each
      • 1 Primary, 3 replicas (1 trying to resync), 1 arbiter
      • Typically 8k-15k active connections
      • AWS, servers spread across availability zones
      • Initial sync takes 1-2 days
      • Oplog is ~5days long (example below)
      • Seems worse with larger replicasets with bigger data (one that includes a 1.5TB db with large compound indexes)
      # oplog size
      rs1:PRIMARY> db.version()
      2.4.8
      rs1:PRIMARY> db.printReplicationInfo()
      configured oplog size:   79149.0830078125MB
      log length start to end: 459190secs (127.55hrs)
      oplog first event time:  Wed Mar 05 2014 09:25:53 GMT+0000 (UTC)
      oplog last event time:   Mon Mar 10 2014 16:59:03 GMT+0000 (UTC)
      now:                     Mon Mar 10 2014 16:59:03 GMT+0000 (UTC)
      

      The slightly anonymized log below contains the moment where the initial sync fails. The primary is m5. The secondary trying to sync is m4. There is some traffic on the servers, which causes slow responses. Connection spam is filtered out.

      Mon Mar 10 16:51:11.048 [rsSync]                Index: (1/3) External Sort Progress: 6932500/8429282    82%
      Mon Mar 10 16:51:12.389 [rsHealthPoll] couldn't connect to m5.example.com:27017: couldn't connect to server m5.example.com:27017
      Mon Mar 10 16:51:18.390 [rsHealthPoll] couldn't connect to m5.example.com:27017: couldn't connect to server m5.example.com:27017
      Mon Mar 10 16:51:21.013 [rsSync]                Index: (1/3) External Sort Progress: 7114000/8429282    84%
      Mon Mar 10 16:51:24.391 [rsHealthPoll] replSet info m5.example.com:27017 is down (or slow to respond):
      Mon Mar 10 16:51:24.391 [rsHealthPoll] replSet member m5.example.com:27017 is now in state DOWN
      Mon Mar 10 16:51:26.405 [rsHealthPoll] replSet member m5.example.com:27017 is up
      Mon Mar 10 16:51:26.406 [rsHealthPoll] replSet member m5.example.com:27017 is now in state PRIMARY
      Mon Mar 10 16:51:31.011 [rsSync]                Index: (1/3) External Sort Progress: 7288600/8429282    86%
      Mon Mar 10 16:51:41.011 [rsSync]                Index: (1/3) External Sort Progress: 7488900/8429282    88%
      Mon Mar 10 16:51:47.191 [rsHealthPoll] couldn't connect to m8.example.com:27017: couldn't connect to server m8.example.com:27017
      Mon Mar 10 16:51:51.026 [rsSync]                Index: (1/3) External Sort Progress: 7619400/8429282    90%
      Mon Mar 10 16:51:53.192 [rsHealthPoll] couldn't connect to m8.example.com:27017: couldn't connect to server m8.example.com:27017
      Mon Mar 10 16:51:54.980 [rsHealthPoll] replset info m8.example.com:27017 heartbeat failed, retrying
      Mon Mar 10 16:52:01.057 [rsSync]                Index: (1/3) External Sort Progress: 7717000/8429282    91%
      Mon Mar 10 16:52:11.568 [rsSync]                Index: (1/3) External Sort Progress: 7825800/8429282    92%
      Mon Mar 10 16:52:21.009 [rsSync]                Index: (1/3) External Sort Progress: 7940600/8429282    94%
      Mon Mar 10 16:52:31.007 [rsSync]                Index: (1/3) External Sort Progress: 8049300/8429282    95%
      Mon Mar 10 16:52:41.003 [rsSync]                Index: (1/3) External Sort Progress: 8151500/8429282    96%
      Mon Mar 10 16:52:51.008 [rsSync]                Index: (1/3) External Sort Progress: 8288500/8429282    98%
      Mon Mar 10 16:52:59.235 [rsSync]         external sort used : 9 files  in 1038 secs
      Mon Mar 10 16:53:09.003 [rsSync]                Index: (2/3) BTree Bottom Up Progress: 1120400/8429282  13%
      Mon Mar 10 16:53:19.003 [rsSync]                Index: (2/3) BTree Bottom Up Progress: 2620200/8429282  31%
      Mon Mar 10 16:53:29.003 [rsSync]                Index: (2/3) BTree Bottom Up Progress: 4067100/8429282  48%
      Mon Mar 10 16:53:39.003 [rsSync]                Index: (2/3) BTree Bottom Up Progress: 5539300/8429282  65%
      Mon Mar 10 16:53:49.003 [rsSync]                Index: (2/3) BTree Bottom Up Progress: 6962200/8429282  82%
      Mon Mar 10 16:53:58.343 [rsSync]         done building bottom layer, going to commit
      Mon Mar 10 16:54:00.684 [rsSync] build index done.  scanned 8429282 total records. 1100.07 secs
      Mon Mar 10 16:54:00.684 [rsSync] build index cxxx_pxxx.day { stat: 1.0 }
      Mon Mar 10 16:54:00.721 [rsSync] build index done.  scanned 0 total records. 0.037 secs
      Mon Mar 10 16:54:00.721 [rsSync] build index cxxx_pxxx.day { caxxx: 1.0 }
      Mon Mar 10 16:54:00.723 [rsSync] build index done.  scanned 0 total records. 0.001 secs
      Mon Mar 10 16:54:00.723 [rsSync] build index cxxx_pxxx.day { bucket: 1.0 }
      Mon Mar 10 16:54:00.724 [rsSync] build index done.  scanned 0 total records. 0 secs
      Mon Mar 10 16:54:00.724 [rsSync] build index cxxx_pxxx.day { bucket: -1.0 }
      Mon Mar 10 16:54:00.747 [rsSync] build index done.  scanned 0 total records. 0.023 secs
      Mon Mar 10 16:54:00.747 [rsSync] build index cxxx_pxxx.day { caxxx: 1.0, stat: 1.0, bucket: 1.0 }
      Mon Mar 10 16:54:00.749 [rsSync] build index done.  scanned 0 total records. 0.001 secs
      Mon Mar 10 16:54:00.749 [rsSync] build index cxxx_pxxx.day { caxxx: 1.0, stat: 1.0, bucket: -1.0 }
      Mon Mar 10 16:54:00.750 [rsSync] build index done.  scanned 0 total records. 0 secs
      Mon Mar 10 16:54:00.750 [rsSync] replSet initial sync cloning indexes for : clxxxx
      Mon Mar 10 16:54:00.860 [rsSync] Socket say send() errno:110 Connection timed out m5.example.com:27017
      Mon Mar 10 16:54:00.923 [rsSync] replSet initial sync exception: 9001 socket exception [SEND_ERROR] server [m5.example.com:27017]  9 attempts remaining
      Mon Mar 10 16:54:17.663 [rsHealthPoll] DBClientCursor::init call() failed
      Mon Mar 10 16:54:17.663 [rsHealthPoll] replset info m8.example.com:27017 heartbeat failed, retrying
      Mon Mar 10 16:54:18.663 [rsHealthPoll] replSet info m8.example.com:27017 is down (or slow to respond):
      Mon Mar 10 16:54:18.663 [rsHealthPoll] replSet member m8.example.com:27017 is now in state DOWN
      Mon Mar 10 16:54:26.664 [rsHealthPoll] replset info m8.example.com:27017 heartbeat failed, retrying
      Mon Mar 10 16:54:26.884 [rsHealthPoll] replSet member m8.example.com:27017 is up
      Mon Mar 10 16:54:26.884 [rsHealthPoll] replSet member m8.example.com:27017 is now in state SECONDARY
      Mon Mar 10 16:54:30.941 [rsSync] replSet initial sync pending
      Mon Mar 10 16:54:30.941 [rsSync] replSet syncing to: m5.example.com:27017
      Mon Mar 10 16:54:31.042 [rsSync] replSet initial sync drop all databases
      Mon Mar 10 16:54:31.061 [rsSync] dropAllDatabasesExceptLocal 19
      Mon Mar 10 16:54:31.123 [rsSync] removeJournalFiles
      Mon Mar 10 16:54:48.391 [rsHealthPoll] DBClientCursor::init call() failed
      Mon Mar 10 16:54:48.391 [rsHealthPoll] replset info m9.example.com:27017 heartbeat failed, retrying
      Mon Mar 10 16:54:48.923 [rsHealthPoll] DBClientCursor::init call() failed
      Mon Mar 10 16:54:48.923 [rsHealthPoll] replset info m8.example.com:27017 heartbeat failed, retrying
      Mon Mar 10 16:54:49.391 [rsHealthPoll] replSet info m9.example.com:27017 is down (or slow to respond):
      Mon Mar 10 16:54:49.391 [rsHealthPoll] replSet member m9.example.com:27017 is now in state DOWN
      Mon Mar 10 16:54:49.923 [rsHealthPoll] DBClientCursor::init call() failed
      Mon Mar 10 16:54:49.923 [rsHealthPoll] replSet info m8.example.com:27017 is down (or slow to respond):
      Mon Mar 10 16:54:49.923 [rsHealthPoll] replSet member m8.example.com:27017 is now in state DOWN
      Mon Mar 10 16:54:52.052 [rsHealthPoll] replSet member m8.example.com:27017 is up
      Mon Mar 10 16:54:52.052 [rsHealthPoll] replSet member m8.example.com:27017 is now in state SECONDARY
      Mon Mar 10 16:54:53.217 [rsSync] removeJournalFiles
      

      What more can I share that would help debug this problem? Does it seem like initial sync is sensitive to high load?

            Assignee:
            milkie@mongodb.com Eric Milkie
            Reporter:
            ryan@incrowdads.com Ryan Witt
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: