-
Type: Bug
-
Resolution: Done
-
Priority: Major - P3
-
None
-
Affects Version/s: 3.0.3
-
Component/s: Replication
-
Environment:Linux 3.13.0-53-generic #89~precise1-Ubuntu SMP Wed May 20 17:42:16 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
32-core / 128GB RAM, all SSD, monster servers.
-
Linux
-
-
(copied to CRM)
Similar to SERVER-17140 but much more insidious; this issue prevented a downed replica from making an initial sync 2x in a row; we had to resort to tcp_tw_reuse to get the replica back online.
Scenario: replica set with, among other things, a large collection of long-term documents (10 million objects / 190GB of data) and a small collection of highly transient (1000 objects / 5MB of data).
During initial sync it takes about an hour for the bulk of the data to be replicated and the indices to be built for the big collection – during this time a large number of the documents in the small collection are updated and deleted. Once the sync is completed the log exploded into a fire-hose of:
I REPL [repl writer worker 12] replication failed to apply update: { ts: Timestamp 1432762108000|244, h: -8874966953786363575, v: 2, op: "u", ns: "lotus_tem I REPL [repl writer worker 4] replication failed to apply update: { ts: Timestamp 1432762108000|232, h: -2423412449985218813, v: 2, op: "u", ns: "lotus_temp I REPL [repl writer worker 10] replication failed to apply update: { ts: Timestamp 1432762108000|259, h: 6507151402806398582, v: 2, op: "u", ns: "lotus_temp I REPL [repl writer worker 13] replication failed to apply update: { ts: Timestamp 1432762108000|251, h: 8927018345590464930, v: 2, op: "u", ns: "lotus_temp I REPL [repl writer worker 8] replication failed to apply update: { ts: Timestamp 1432762108000|242, h: 7518410875297535456, v: 2, op: "u", ns: "lotus_temp. I REPL [repl writer worker 12] replication info adding missing object I REPL [repl writer worker 12] replication missing object not found on source. presumably deleted later in oplog I REPL [repl writer worker 12] replication o2: { _id: ObjectId('5566368006c56f080dfc5033') } I REPL [repl writer worker 12] replication o firstfield: $set I REPL [repl writer worker 4] replication info adding missing object I REPL [repl writer worker 4] replication missing object not found on source. presumably deleted later in oplog I REPL [repl writer worker 4] replication o2: { _id: ObjectId('556636ec06c56f330efc4fc1') } I REPL [repl writer worker 4] replication o firstfield: $set I REPL [repl writer worker 10] replication info adding missing object I REPL [repl writer worker 10] replication missing object not found on source. presumably deleted later in oplog I REPL [repl writer worker 10] replication o2: { _id: ObjectId('556636ca07c56f642fa8f0a5') } I REPL [repl writer worker 10] replication o firstfield: $set I REPL [repl writer worker 13] replication info adding missing object I REPL [repl writer worker 13] replication missing object not found on source. presumably deleted later in oplog I REPL [repl writer worker 13] replication o2: { _id: ObjectId('5566366c06c56f340dfc4fdc') } I REPL [repl writer worker 13] replication o firstfield: $set I REPL [repl writer worker 8] replication info adding missing object I REPL [repl writer worker 8] replication missing object not found on source. presumably deleted later in oplog I REPL [repl writer worker 8] replication o2: { _id: ObjectId('556634a508c56ffa267b23cd') } I REPL [repl writer worker 8] replication o firstfield: $set
This goes on for about 140,000 lines of log in about a minute – during which time it seems to be making a whole lot of short-lived connections – we can see this happening in the log for the server it’s syncing from:
2015-05-27T19:18:43.379-0400 I NETWORK [conn100064] end connection x.x.x.x:35686 (1114 connections now open) 2015-05-27T19:18:43.379-0400 I NETWORK [initandlisten] connection accepted from x.x.x.x:35687 #100065 (1115 connections now open) 2015-05-27T19:18:43.380-0400 I NETWORK [conn100065] end connection x.x.x.x:35687 (1114 connections now open) 2015-05-27T19:18:43.380-0400 I NETWORK [initandlisten] connection accepted from x.x.x.x:35688 #100066 (1115 connections now open) 2015-05-27T19:18:43.380-0400 I NETWORK [conn100066] end connection x.x.x.x:35688 (1114 connections now open) 2015-05-27T19:18:43.381-0400 I NETWORK [initandlisten] connection accepted from x.x.x.x:35689 #100067 (1115 connections now open) 2015-05-27T19:18:43.381-0400 I NETWORK [conn100067] end connection x.x.x.x:35689 (1114 connections now open) 2015-05-27T19:18:43.382-0400 I NETWORK [initandlisten] connection accepted from x.x.x.x:35690 #100068 (1115 connections now open) 2015-05-27T19:18:43.382-0400 I NETWORK [conn100068] end connection x.x.x.x:35690 (1114 connections now open) 2015-05-27T19:18:43.383-0400 I NETWORK [initandlisten] connection accepted from x.x.x.x:35691 #100069 (1115 connections now open) 2015-05-27T19:18:43.383-0400 I NETWORK [conn100069] end connection x.x.x.x:35691 (1114 connections now open) 2015-05-27T19:18:43.384-0400 I NETWORK [initandlisten] connection accepted from x.x.x.x:35692 #100070 (1115 connections now open) 2015-05-27T19:18:43.384-0400 I NETWORK [conn100070] end connection x.x.x.x:35692 (1114 connections now open)
Note the freakishly short lifetimes! This of course isn’t healthy for the primary but it gets worse than that; on the new replica, while this firehose was blasting the primary, I ran a quick check of the TIME_WAIT reservations:
# netstat -an | grep TIME_WAIT | wc -l 25246
This means we’re exhausting network ports at a furious rate and of course we end up choking:
I REPL [repl writer worker 2] replication info adding missing object W NETWORK [repl writer worker 2] Failed to connect to x.x.x.x:27017, reason: errno:99 Cannot assign requested address I REPL [repl writer worker 2] repl: couldn't connect to server war-mongo2:27017 (x.x.x.x), connection attempt failed W REPL [repl writer worker 2] network problem detected while connecting to the sync source, attempt 1 of 3 W NETWORK [repl writer worker 2] Failed to connect to x.x.x.x:27017, reason: errno:99 Cannot assign requested address I REPL [repl writer worker 2] repl: couldn't connect to server war-mongo2:27017 (x.x.x.x), connection attempt failed W REPL [repl writer worker 2] network problem detected while connecting to the sync source, attempt 2 of 3 W NETWORK [repl writer worker 2] Failed to connect to x.x.x.x:27017, reason: errno:99 Cannot assign requested address I REPL [repl writer worker 2] repl: couldn't connect to server war-mongo2:27017 (x.x.x.x), connection attempt failed W REPL [repl writer worker 2] network problem detected while connecting to the sync source, attempt 3 of 3 I - [repl writer worker 2] Assertion: 15916:Can no longer connect to initial sync source: war-mongo2:27017 I CONTROL [repl writer worker 2] 0xf6d639 0xf0d371 0xef25cf 0xef267c 0xcaced2 0xcac176 0xcb2b5c 0xf0418b 0xfbab94 0x7f0eb110ce9a 0x7f0eafbe338d ----- BEGIN BACKTRACE ----- {"backtrace":[{"b":"400000","o":"B6D639"},{"b":"400000","o":"B0D371"},{"b":"400000","o":"AF25CF"},{"b":"400000","o":"AF267C"},{"b":"400000","o":"8ACED2"},{"b":"400000","o":"8AC176"},{"b":"4 mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf6d639] mongod(_ZN5mongo10logContextEPKc+0xE1) [0xf0d371] mongod(_ZN5mongo11msgassertedEiPKc+0xAF) [0xef25cf] mongod(+0xAF267C) [0xef267c] mongod(_ZN5mongo4repl4Sync13getMissingDocEPNS_16OperationContextEPNS_8DatabaseERKNS_7BSONObjE+0x622) [0xcaced2] mongod(_ZN5mongo4repl4Sync11shouldRetryEPNS_16OperationContextERKNS_7BSONObjE+0x196) [0xcac176] mongod(_ZN5mongo4repl21multiInitialSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x8C) [0xcb2b5c] mongod(_ZN5mongo10threadpool6Worker4loopERKSs+0x2FB) [0xf0418b] mongod(+0xBBAB94) [0xfbab94] libpthread.so.0(+0x7E9A) [0x7f0eb110ce9a] libc.so.6(clone+0x6D) [0x7f0eafbe338d] ----- END BACKTRACE ----- 2015-05-27T19:18:56.663-0400 E REPL [repl writer worker 2] writer worker caught exception: :: caused by :: 15916 Can no longer connect to initial sync source: war-mongo2:27017 on: { ts 2015-05-27T19:18:56.663-0400 I - [repl writer worker 2] Fatal Assertion 16361 2015-05-27T19:18:56.663-0400 I - [repl writer worker 2] ***aborting after fassert() failure
We had this happen twice before we started messing with kernel parameters which ultimately got us up and running.
Since we also rebuild larger "mostly write no delete" replica-sets the same day and never saw this it seems likely that our highly-transient collection is causing this sort of connection abuse.
- is duplicated by
-
SERVER-19176 Mongo instance dies while creating a large replica:
- Closed
- related to
-
SERVER-42078 mongodb is unable to replay the oplog and restarts the initial-sync
- Closed
- mentioned in
-
Page Loading...