Loading...

XML

Word

Printable

JSON

Type: Spec Change
Resolution: Unresolved
Priority: Minor - P4
Fix Version/s: None
Component/s: Performance, SRV Polling
Labels:
- leads-triage

Driver Changes:
Needed

Summary

Opening this to record a conversation that occured over Slack regarding srvMaxHosts (DRIVERS-1519). When srvMaxHosts is being used, drivers currently only drop connections if existing hosts are unavailable or not present in the most recently polled SRV records. This leads to "sticky" behavior whereby an application is more likely to keep the hosts it originally selected even as new hosts are added to a cluster over time (e.g. Atlas cluster scales up). jeff.yemin's proposal was to reduce the stickiness and allow existing mongos connections to be exchanged more frequently:

As an example: let's say there are 10 mongos server and srvMaxHosts is 5. The customer determines that they need to expand the number of mongos server in order to reduce load on each. So they add 10 more and the SRV record is updated to reflect the change. What happens? Since all existing MongoClient instances are already connected to 5 of the existing 10, then when the SRV record is next polled, nothing will change. They will all stay connected to the same 5 hosts, and the 10 new ones will get no load until applications are restarted.

To avoid unnecessary churn on each polling interval, Jeff also suggested storing a snapshot of the most recent SRV results so that reshuffling need only happen if the SRV results have changed (i.e. mongos hosts are added or removed from the cluster).

To present both sides of the argument, james.kovacs's response follows:

To play devil's advocate, let's take the example of 100 mongos nodes in the SRV record. App servers are configured with srvMaxHosts set to 5. The customer wants to take 10% of their mongos instances out for maintenance. This would cause a lot of churn as it is unlikely that any app server would choose the same 5 mongos instances out of the remaining 90. And this problem would repeat when they re-introduced those 10 and took the next 10 out for maintenance.

Atlas tries to perform version upgrades (and hardware upgrades) in place. Even if a new VM needs to be provisioned, there is code in the planner to keep the same host names so the DNS records don't have to be updated. Since Atlas has a mongos instance deployed on the same VM as every mongod, if a customer added a shard to a large cluster, this would likely cause their applications to recreate all their connection pools as they'd likely be redirected to new mongos instances if we performed random selection on every SRV change. I think that sticky mongos instances are a more predictable solution in terms of scaling a cluster.

Motivation

Who is the affected end user?

Customers using srvMaxHosts in long-running applications.

How does this affect the end user?

As the number of mongoses in a cluster changes over time, app servers using srvMaxHosts might stick to their originally selected mongos hosts. Across the entire application, this could lead to an imbalance in connections to mongos hosts.

How likely is it that this problem or use case will occur?

This will occur in applications where the cluster expands or contracts in size (i.e. mongoses are added or removed and SRV records are updated).

If the problem does occur, what are the consequences and how severe are they?

There may be a performance concern where some mongoses will retain more connections/utilization than others.

Is this issue urgent?

No. This ticket is being opened to record a Slack discussion in the event that we need to revisit this feature down the line.

Is this ticket required by a downstream team?

No.

Is this ticket only for tests?

No.

related to

GODRIVER-2222 Fix srvMaxHosts SRV polling shuffling logic

Closed

JAVA-4400 Improve selection criteria for srvMaxHosts

Closed

Assignee:: Unassigned
Reporter:: Jeremy Mikola
Votes:: 1 Vote for this issue
Watchers:: 6 Start watching this issue

Created:: Nov 10 2021 02:50:11 PM UTC
Updated:: Jun 06 2023 11:16:29 AM UTC

Details

Description

Summary

Motivation

Who is the affected end user?

How does this affect the end user?

How likely is it that this problem or use case will occur?

If the problem does occur, what are the consequences and how severe are they?

Is this issue urgent?

Is this ticket required by a downstream team?

Is this ticket only for tests?

Attachments

Issue Links

Activity

People

Dates