-
Type: Bug
-
Resolution: Done
-
Priority: Blocker - P1
-
None
-
Affects Version/s: 2.4.5, 2.4.6, 2.4.7, 2.4.8, 2.4.9
-
Component/s: Replication
-
None
-
ALL
I've gotten this error many tens of times on 2.4.x
Symptoms:
- Secondary clones dbs
- Secondary builds indexes
- Immediately upon completion, secondary resyncs again
- My countenance darkens and my heart sinks into despair
- Sometimes (maybe 10% of the time), it actually finishes initial sync
Setup
- Replica set with ~1.5TB data
- Biggest collection ~500GB data, 80M docs with 4 indexes 3GB-5GB each
- 1 Primary, 3 replicas (1 trying to resync), 1 arbiter
- Typically 8k-15k active connections
- AWS, servers spread across availability zones
- Initial sync takes 1-2 days
- Oplog is ~5days long (example below)
- Seems worse with larger replicasets with bigger data (one that includes a 1.5TB db with large compound indexes)
# oplog size rs1:PRIMARY> db.version() 2.4.8 rs1:PRIMARY> db.printReplicationInfo() configured oplog size: 79149.0830078125MB log length start to end: 459190secs (127.55hrs) oplog first event time: Wed Mar 05 2014 09:25:53 GMT+0000 (UTC) oplog last event time: Mon Mar 10 2014 16:59:03 GMT+0000 (UTC) now: Mon Mar 10 2014 16:59:03 GMT+0000 (UTC)
The slightly anonymized log below contains the moment where the initial sync fails. The primary is m5. The secondary trying to sync is m4. There is some traffic on the servers, which causes slow responses. Connection spam is filtered out.
Mon Mar 10 16:51:11.048 [rsSync] Index: (1/3) External Sort Progress: 6932500/8429282 82% Mon Mar 10 16:51:12.389 [rsHealthPoll] couldn't connect to m5.example.com:27017: couldn't connect to server m5.example.com:27017 Mon Mar 10 16:51:18.390 [rsHealthPoll] couldn't connect to m5.example.com:27017: couldn't connect to server m5.example.com:27017 Mon Mar 10 16:51:21.013 [rsSync] Index: (1/3) External Sort Progress: 7114000/8429282 84% Mon Mar 10 16:51:24.391 [rsHealthPoll] replSet info m5.example.com:27017 is down (or slow to respond): Mon Mar 10 16:51:24.391 [rsHealthPoll] replSet member m5.example.com:27017 is now in state DOWN Mon Mar 10 16:51:26.405 [rsHealthPoll] replSet member m5.example.com:27017 is up Mon Mar 10 16:51:26.406 [rsHealthPoll] replSet member m5.example.com:27017 is now in state PRIMARY Mon Mar 10 16:51:31.011 [rsSync] Index: (1/3) External Sort Progress: 7288600/8429282 86% Mon Mar 10 16:51:41.011 [rsSync] Index: (1/3) External Sort Progress: 7488900/8429282 88% Mon Mar 10 16:51:47.191 [rsHealthPoll] couldn't connect to m8.example.com:27017: couldn't connect to server m8.example.com:27017 Mon Mar 10 16:51:51.026 [rsSync] Index: (1/3) External Sort Progress: 7619400/8429282 90% Mon Mar 10 16:51:53.192 [rsHealthPoll] couldn't connect to m8.example.com:27017: couldn't connect to server m8.example.com:27017 Mon Mar 10 16:51:54.980 [rsHealthPoll] replset info m8.example.com:27017 heartbeat failed, retrying Mon Mar 10 16:52:01.057 [rsSync] Index: (1/3) External Sort Progress: 7717000/8429282 91% Mon Mar 10 16:52:11.568 [rsSync] Index: (1/3) External Sort Progress: 7825800/8429282 92% Mon Mar 10 16:52:21.009 [rsSync] Index: (1/3) External Sort Progress: 7940600/8429282 94% Mon Mar 10 16:52:31.007 [rsSync] Index: (1/3) External Sort Progress: 8049300/8429282 95% Mon Mar 10 16:52:41.003 [rsSync] Index: (1/3) External Sort Progress: 8151500/8429282 96% Mon Mar 10 16:52:51.008 [rsSync] Index: (1/3) External Sort Progress: 8288500/8429282 98% Mon Mar 10 16:52:59.235 [rsSync] external sort used : 9 files in 1038 secs Mon Mar 10 16:53:09.003 [rsSync] Index: (2/3) BTree Bottom Up Progress: 1120400/8429282 13% Mon Mar 10 16:53:19.003 [rsSync] Index: (2/3) BTree Bottom Up Progress: 2620200/8429282 31% Mon Mar 10 16:53:29.003 [rsSync] Index: (2/3) BTree Bottom Up Progress: 4067100/8429282 48% Mon Mar 10 16:53:39.003 [rsSync] Index: (2/3) BTree Bottom Up Progress: 5539300/8429282 65% Mon Mar 10 16:53:49.003 [rsSync] Index: (2/3) BTree Bottom Up Progress: 6962200/8429282 82% Mon Mar 10 16:53:58.343 [rsSync] done building bottom layer, going to commit Mon Mar 10 16:54:00.684 [rsSync] build index done. scanned 8429282 total records. 1100.07 secs Mon Mar 10 16:54:00.684 [rsSync] build index cxxx_pxxx.day { stat: 1.0 } Mon Mar 10 16:54:00.721 [rsSync] build index done. scanned 0 total records. 0.037 secs Mon Mar 10 16:54:00.721 [rsSync] build index cxxx_pxxx.day { caxxx: 1.0 } Mon Mar 10 16:54:00.723 [rsSync] build index done. scanned 0 total records. 0.001 secs Mon Mar 10 16:54:00.723 [rsSync] build index cxxx_pxxx.day { bucket: 1.0 } Mon Mar 10 16:54:00.724 [rsSync] build index done. scanned 0 total records. 0 secs Mon Mar 10 16:54:00.724 [rsSync] build index cxxx_pxxx.day { bucket: -1.0 } Mon Mar 10 16:54:00.747 [rsSync] build index done. scanned 0 total records. 0.023 secs Mon Mar 10 16:54:00.747 [rsSync] build index cxxx_pxxx.day { caxxx: 1.0, stat: 1.0, bucket: 1.0 } Mon Mar 10 16:54:00.749 [rsSync] build index done. scanned 0 total records. 0.001 secs Mon Mar 10 16:54:00.749 [rsSync] build index cxxx_pxxx.day { caxxx: 1.0, stat: 1.0, bucket: -1.0 } Mon Mar 10 16:54:00.750 [rsSync] build index done. scanned 0 total records. 0 secs Mon Mar 10 16:54:00.750 [rsSync] replSet initial sync cloning indexes for : clxxxx Mon Mar 10 16:54:00.860 [rsSync] Socket say send() errno:110 Connection timed out m5.example.com:27017 Mon Mar 10 16:54:00.923 [rsSync] replSet initial sync exception: 9001 socket exception [SEND_ERROR] server [m5.example.com:27017] 9 attempts remaining Mon Mar 10 16:54:17.663 [rsHealthPoll] DBClientCursor::init call() failed Mon Mar 10 16:54:17.663 [rsHealthPoll] replset info m8.example.com:27017 heartbeat failed, retrying Mon Mar 10 16:54:18.663 [rsHealthPoll] replSet info m8.example.com:27017 is down (or slow to respond): Mon Mar 10 16:54:18.663 [rsHealthPoll] replSet member m8.example.com:27017 is now in state DOWN Mon Mar 10 16:54:26.664 [rsHealthPoll] replset info m8.example.com:27017 heartbeat failed, retrying Mon Mar 10 16:54:26.884 [rsHealthPoll] replSet member m8.example.com:27017 is up Mon Mar 10 16:54:26.884 [rsHealthPoll] replSet member m8.example.com:27017 is now in state SECONDARY Mon Mar 10 16:54:30.941 [rsSync] replSet initial sync pending Mon Mar 10 16:54:30.941 [rsSync] replSet syncing to: m5.example.com:27017 Mon Mar 10 16:54:31.042 [rsSync] replSet initial sync drop all databases Mon Mar 10 16:54:31.061 [rsSync] dropAllDatabasesExceptLocal 19 Mon Mar 10 16:54:31.123 [rsSync] removeJournalFiles Mon Mar 10 16:54:48.391 [rsHealthPoll] DBClientCursor::init call() failed Mon Mar 10 16:54:48.391 [rsHealthPoll] replset info m9.example.com:27017 heartbeat failed, retrying Mon Mar 10 16:54:48.923 [rsHealthPoll] DBClientCursor::init call() failed Mon Mar 10 16:54:48.923 [rsHealthPoll] replset info m8.example.com:27017 heartbeat failed, retrying Mon Mar 10 16:54:49.391 [rsHealthPoll] replSet info m9.example.com:27017 is down (or slow to respond): Mon Mar 10 16:54:49.391 [rsHealthPoll] replSet member m9.example.com:27017 is now in state DOWN Mon Mar 10 16:54:49.923 [rsHealthPoll] DBClientCursor::init call() failed Mon Mar 10 16:54:49.923 [rsHealthPoll] replSet info m8.example.com:27017 is down (or slow to respond): Mon Mar 10 16:54:49.923 [rsHealthPoll] replSet member m8.example.com:27017 is now in state DOWN Mon Mar 10 16:54:52.052 [rsHealthPoll] replSet member m8.example.com:27017 is up Mon Mar 10 16:54:52.052 [rsHealthPoll] replSet member m8.example.com:27017 is now in state SECONDARY Mon Mar 10 16:54:53.217 [rsSync] removeJournalFiles
What more can I share that would help debug this problem? Does it seem like initial sync is sensitive to high load?
- related to
-
SERVER-21068 Ability to resume initial sync attemps
- Closed