-
Type: Bug
-
Resolution: Done
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Replication
-
None
-
ALL
We are initial syncing a big database (~70Gb) from a replica set member A to replica set member B. Both members are running mongod instance on Windows. Windows TCP connection interval is set to 30min.
And from this post:
https://docs.mongodb.com/v3.2/faq/diagnostics/
We understand that MongoDB will set its own TCP timeout interval to 10min on Windows OS.
Under these settings, we found we are not able to complete initial syncing, because:
1. We need more than 10 minutes to build index after the big database is copied from A to B.
2. Looks like MongoDB cannot ACK to TCP requests when building index
3. Consequently after building index, instance B will receive a TCP connection timeout error, and need to start over the whole initial sync
4. So stuck at this big database now.
Please suggest.
Log containing this error:
2017-08-23T07:59:00.164+0800 I STORAGE [rsSync] 14456650 objects cloned so far from collection DB.COL 2017-08-23T07:59:04.007+0800 I STORAGE [rsSync] clone DB.COL 14457727 2017-08-23T07:59:53.315+0800 I INDEX [rsSync] build index on: stratus.position properties: { v: 1, key: { _id: 1 }, name: "_id_", ns: "DB.COL" } 2017-08-23T07:59:53.316+0800 I INDEX [rsSync] building index using bulk method; build may temporarily use up to 500 megabytes of RAM 2017-08-23T07:59:56.019+0800 I - [rsSync] Index Build: 84400/14467992 0% 2017-08-23T07:59:59.000+0800 I - [rsSync] Index Build: 195300/14467992 1% 2017-08-23T08:00:02.002+0800 I - [rsSync] Index Build: 271400/14467992 1% ...... 2017-08-23T08:10:21.002+0800 I - [rsSync] Index Build: 14266400/14467992 98% 2017-08-23T08:10:24.000+0800 I - [rsSync] Index Build: 14325000/14467992 99% 2017-08-23T08:10:27.002+0800 I - [rsSync] Index Build: 14408000/14467992 99% 2017-08-23T08:10:38.095+0800 I INDEX [rsSync] build index done. scanned 14467992 total records. 644 secs 2017-08-23T08:10:38.104+0800 I REPL [rsSync] initial sync data copy, starting syncup 2017-08-23T08:10:38.106+0800 I REPL [rsSync] oplog sync 1 of 3 2017-08-23T08:10:38.109+0800 I NETWORK [rsSync] Socket send() errno:10054 An existing connection was forcibly closed by the remote host. IP:port 2017-08-23T08:10:38.114+0800 I REPL [rsSync] connection lost to hostname:port; is your tcp keepalive interval set appropriately? 2017-08-23T08:10:38.136+0800 E REPL [rsSync] 9001 socket exception [FAILED_STATE] server [hostname:port(IP) failed] 2017-08-23T08:10:38.136+0800 E REPL [rsSync] initial sync attempt failed, 8 attempts remaining