-
Type: Improvement
-
Resolution: Done
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Internal Client, Sharding
-
None
-
Service Arch
When a client attempts to connect through a mongos, according to the documentation, mongos looks at all of the mongod instances with eligible ping times and picks one at random. Once the connection is made, it will remain in place until the client closes it or it fails, at which time it will transparently renegotiate with another mongod.
This behavior can lead to semi-pathological outcomes under the following circumstances:
1. Clients are long-running AND
2. Overall replica set load changes over time AND/OR
3. Mongod instances are restarted AND/OR
4. Ping times change over time
The above scenarios can lead to extremely unbalanced loads, with some replicas supporting many times the number of connections as others and operating with orders of magnitude higher CPU. This is because clients connected to the highly-loaded mongod instances have no opportunity to notice that there are other more-lightly-loaded instances available. As a result, the overall performance of the replica set, and the application, suffers.
Such outcomes could be significantly mitigated without a real "load-balancing" feature, simply by allowing clients to transparently renegotiate their mongod connections after a certain amount of time, just as they do in the case of failures. The existing randomization should be sufficient to keep the load close to balanced; the mongod instances just need an opportunity for clients to connect to them.