-
Type: Question
-
Resolution: Incomplete
-
Priority: Major - P3
-
None
-
Affects Version/s: 2.6.6
-
Component/s: Replication
-
None
We have a problem with our replica set. It's running on three virtual servers and if any of the mongod's goes down, it normally continues working with the rest. However, if any of the servers totally disappears, i.e. won't respond to network traffic at all (if down, or block all outgoing traffic via firewall, or poweroff the server suddenly), all queries to the replica set take 15 seconds extra. Judging from the network traffic, it's due to TCP retransmits.
This 15 second extra time for every query makes our load balancer think all nodes are down and it shuts down traffic to the whole setup.
Since using console mongo the other replica set members works fine, we originally posted this as a bug in the node.js driver (https://jira.mongodb.org/browse/NODE-350), but later tried with the PHP driver and were able to reproduce a similar (although not identical) behaviour.
We also reproduced this problem in our secondary setup in another data center, so this shouldn't be data center specific. Both might be running the same virtualization platform, though, we haven't looked into that yet.
Any ideas how to go forward with this?