-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Replication
-
Environment:Ubuntu 16.04
-
ALL
I have a 3-node replica set running version 3.4.10 on Ubuntu 16.04.
I ran a schema update that touched all 7 million rows of a collection with a $set and a $rename. Because one of the secondaries is about 30ms away in Azure, I used majority write concern to slow down the update and make sure at least one of the secondaries would stay in sync.
The query started at 14:19:29. At that point the Azure slave was probably 3-5 minutes behind because of earlier schema migrations. But by 14:27:00, the main secondary was unable to get results for oplog queries:
Jan 13 14:27:00 secondary mongod.27017[28273]: [replication-163] Restarting oplog query due to error: ExceededTimeLimit: Operation timed out, request was RemoteCommand 18482564 -- target:primary:27017 db:local expDate:2018-01-13T14:27:00.216+0000 cmd:{ getMore: 16483339842, collection: "oplog.rs", maxTimeMS: 5000, term: 25, lastKnownCommittedOpTime: { ts: Timestamp 1515853539000|8343, t: 25 } }. Last fetched optime (with hash): { ts: Timestamp 1515853555000|1852, t: 25 }[1175973526525408650]. Restarts remaining: 3
That's also the time the replica set stopped accepting connections from clients.
To get things running again I had to kill all three mongod processes (and then kill -9 because the shutdown tends to hang while in this state).
After letting the nodes sync up, I was able to reproduce this again with the same query.
I can provide logs and the query privately if that would be useful.
Just guessing based on what I learned in SERVER-32398, maybe the primary froze up because it ran out of cache while waiting for the secondary to apply changes. But the update was running with majority read concern so I would have thought the secondary couldn't have gotten far enough behind for that to occur.