Loading...

Type: Bug
Resolution: Done
Priority: Blocker - P1
Fix Version/s: None
Affects Version/s: 2.4.5, 2.4.6, 2.4.7, 2.4.8, 2.4.9
Component/s: Replication
Labels:
None

Operating System:
ALL

I've gotten this error many tens of times on 2.4.x

Symptoms:

Secondary clones dbs
Secondary builds indexes
Immediately upon completion, secondary resyncs again
My countenance darkens and my heart sinks into despair
Sometimes (maybe 10% of the time), it actually finishes initial sync

Setup

Replica set with ~1.5TB data
Biggest collection ~500GB data, 80M docs with 4 indexes 3GB-5GB each
1 Primary, 3 replicas (1 trying to resync), 1 arbiter
Typically 8k-15k active connections
AWS, servers spread across availability zones
Initial sync takes 1-2 days
Oplog is ~5days long (example below)
Seems worse with larger replicasets with bigger data (one that includes a 1.5TB db with large compound indexes)

# oplog size
rs1:PRIMARY> db.version()
2.4.8
rs1:PRIMARY> db.printReplicationInfo()
configured oplog size:   79149.0830078125MB
log length start to end: 459190secs (127.55hrs)
oplog first event time:  Wed Mar 05 2014 09:25:53 GMT+0000 (UTC)
oplog last event time:   Mon Mar 10 2014 16:59:03 GMT+0000 (UTC)
now:                     Mon Mar 10 2014 16:59:03 GMT+0000 (UTC)

The slightly anonymized log below contains the moment where the initial sync fails. The primary is m5. The secondary trying to sync is m4. There is some traffic on the servers, which causes slow responses. Connection spam is filtered out.

Mon Mar 10 16:51:11.048 [rsSync]                Index: (1/3) External Sort Progress: 6932500/8429282    82%
Mon Mar 10 16:51:12.389 [rsHealthPoll] couldn't connect to m5.example.com:27017: couldn't connect to server m5.example.com:27017
Mon Mar 10 16:51:18.390 [rsHealthPoll] couldn't connect to m5.example.com:27017: couldn't connect to server m5.example.com:27017
Mon Mar 10 16:51:21.013 [rsSync]                Index: (1/3) External Sort Progress: 7114000/8429282    84%
Mon Mar 10 16:51:24.391 [rsHealthPoll] replSet info m5.example.com:27017 is down (or slow to respond):
Mon Mar 10 16:51:24.391 [rsHealthPoll] replSet member m5.example.com:27017 is now in state DOWN
Mon Mar 10 16:51:26.405 [rsHealthPoll] replSet member m5.example.com:27017 is up
Mon Mar 10 16:51:26.406 [rsHealthPoll] replSet member m5.example.com:27017 is now in state PRIMARY
Mon Mar 10 16:51:31.011 [rsSync]                Index: (1/3) External Sort Progress: 7288600/8429282    86%
Mon Mar 10 16:51:41.011 [rsSync]                Index: (1/3) External Sort Progress: 7488900/8429282    88%
Mon Mar 10 16:51:47.191 [rsHealthPoll] couldn't connect to m8.example.com:27017: couldn't connect to server m8.example.com:27017
Mon Mar 10 16:51:51.026 [rsSync]                Index: (1/3) External Sort Progress: 7619400/8429282    90%
Mon Mar 10 16:51:53.192 [rsHealthPoll] couldn't connect to m8.example.com:27017: couldn't connect to server m8.example.com:27017
Mon Mar 10 16:51:54.980 [rsHealthPoll] replset info m8.example.com:27017 heartbeat failed, retrying
Mon Mar 10 16:52:01.057 [rsSync]                Index: (1/3) External Sort Progress: 7717000/8429282    91%
Mon Mar 10 16:52:11.568 [rsSync]                Index: (1/3) External Sort Progress: 7825800/8429282    92%
Mon Mar 10 16:52:21.009 [rsSync]                Index: (1/3) External Sort Progress: 7940600/8429282    94%
Mon Mar 10 16:52:31.007 [rsSync]                Index: (1/3) External Sort Progress: 8049300/8429282    95%
Mon Mar 10 16:52:41.003 [rsSync]                Index: (1/3) External Sort Progress: 8151500/8429282    96%
Mon Mar 10 16:52:51.008 [rsSync]                Index: (1/3) External Sort Progress: 8288500/8429282    98%
Mon Mar 10 16:52:59.235 [rsSync]         external sort used : 9 files  in 1038 secs
Mon Mar 10 16:53:09.003 [rsSync]                Index: (2/3) BTree Bottom Up Progress: 1120400/8429282  13%
Mon Mar 10 16:53:19.003 [rsSync]                Index: (2/3) BTree Bottom Up Progress: 2620200/8429282  31%
Mon Mar 10 16:53:29.003 [rsSync]                Index: (2/3) BTree Bottom Up Progress: 4067100/8429282  48%
Mon Mar 10 16:53:39.003 [rsSync]                Index: (2/3) BTree Bottom Up Progress: 5539300/8429282  65%
Mon Mar 10 16:53:49.003 [rsSync]                Index: (2/3) BTree Bottom Up Progress: 6962200/8429282  82%
Mon Mar 10 16:53:58.343 [rsSync]         done building bottom layer, going to commit
Mon Mar 10 16:54:00.684 [rsSync] build index done.  scanned 8429282 total records. 1100.07 secs
Mon Mar 10 16:54:00.684 [rsSync] build index cxxx_pxxx.day { stat: 1.0 }
Mon Mar 10 16:54:00.721 [rsSync] build index done.  scanned 0 total records. 0.037 secs
Mon Mar 10 16:54:00.721 [rsSync] build index cxxx_pxxx.day { caxxx: 1.0 }
Mon Mar 10 16:54:00.723 [rsSync] build index done.  scanned 0 total records. 0.001 secs
Mon Mar 10 16:54:00.723 [rsSync] build index cxxx_pxxx.day { bucket: 1.0 }
Mon Mar 10 16:54:00.724 [rsSync] build index done.  scanned 0 total records. 0 secs
Mon Mar 10 16:54:00.724 [rsSync] build index cxxx_pxxx.day { bucket: -1.0 }
Mon Mar 10 16:54:00.747 [rsSync] build index done.  scanned 0 total records. 0.023 secs
Mon Mar 10 16:54:00.747 [rsSync] build index cxxx_pxxx.day { caxxx: 1.0, stat: 1.0, bucket: 1.0 }
Mon Mar 10 16:54:00.749 [rsSync] build index done.  scanned 0 total records. 0.001 secs
Mon Mar 10 16:54:00.749 [rsSync] build index cxxx_pxxx.day { caxxx: 1.0, stat: 1.0, bucket: -1.0 }
Mon Mar 10 16:54:00.750 [rsSync] build index done.  scanned 0 total records. 0 secs
Mon Mar 10 16:54:00.750 [rsSync] replSet initial sync cloning indexes for : clxxxx
Mon Mar 10 16:54:00.860 [rsSync] Socket say send() errno:110 Connection timed out m5.example.com:27017
Mon Mar 10 16:54:00.923 [rsSync] replSet initial sync exception: 9001 socket exception [SEND_ERROR] server [m5.example.com:27017]  9 attempts remaining
Mon Mar 10 16:54:17.663 [rsHealthPoll] DBClientCursor::init call() failed
Mon Mar 10 16:54:17.663 [rsHealthPoll] replset info m8.example.com:27017 heartbeat failed, retrying
Mon Mar 10 16:54:18.663 [rsHealthPoll] replSet info m8.example.com:27017 is down (or slow to respond):
Mon Mar 10 16:54:18.663 [rsHealthPoll] replSet member m8.example.com:27017 is now in state DOWN
Mon Mar 10 16:54:26.664 [rsHealthPoll] replset info m8.example.com:27017 heartbeat failed, retrying
Mon Mar 10 16:54:26.884 [rsHealthPoll] replSet member m8.example.com:27017 is up
Mon Mar 10 16:54:26.884 [rsHealthPoll] replSet member m8.example.com:27017 is now in state SECONDARY
Mon Mar 10 16:54:30.941 [rsSync] replSet initial sync pending
Mon Mar 10 16:54:30.941 [rsSync] replSet syncing to: m5.example.com:27017
Mon Mar 10 16:54:31.042 [rsSync] replSet initial sync drop all databases
Mon Mar 10 16:54:31.061 [rsSync] dropAllDatabasesExceptLocal 19
Mon Mar 10 16:54:31.123 [rsSync] removeJournalFiles
Mon Mar 10 16:54:48.391 [rsHealthPoll] DBClientCursor::init call() failed
Mon Mar 10 16:54:48.391 [rsHealthPoll] replset info m9.example.com:27017 heartbeat failed, retrying
Mon Mar 10 16:54:48.923 [rsHealthPoll] DBClientCursor::init call() failed
Mon Mar 10 16:54:48.923 [rsHealthPoll] replset info m8.example.com:27017 heartbeat failed, retrying
Mon Mar 10 16:54:49.391 [rsHealthPoll] replSet info m9.example.com:27017 is down (or slow to respond):
Mon Mar 10 16:54:49.391 [rsHealthPoll] replSet member m9.example.com:27017 is now in state DOWN
Mon Mar 10 16:54:49.923 [rsHealthPoll] DBClientCursor::init call() failed
Mon Mar 10 16:54:49.923 [rsHealthPoll] replSet info m8.example.com:27017 is down (or slow to respond):
Mon Mar 10 16:54:49.923 [rsHealthPoll] replSet member m8.example.com:27017 is now in state DOWN
Mon Mar 10 16:54:52.052 [rsHealthPoll] replSet member m8.example.com:27017 is up
Mon Mar 10 16:54:52.052 [rsHealthPoll] replSet member m8.example.com:27017 is now in state SECONDARY
Mon Mar 10 16:54:53.217 [rsSync] removeJournalFiles

What more can I share that would help debug this problem? Does it seem like initial sync is sensitive to high load?

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

m5.clean.log.gz
5.38 MB
Mar 10 2014 10:15:48 PM UTC

related to

SERVER-21068 Ability to resume initial sync attemps

Closed

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates