Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Unresolved
Priority: Minor - P4
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
- car-product-sync
Environment:
MongoDB 5 data and config nodes run on GCP VMs. Mongos runs as a Kubernetes deployment with a Kubernetes headless service acting as the LB that serves SRV records. The driver we're using is Go 1.13.2

Assigned Teams:

Catalog and Routing
Operating System:
ALL
Steps To Reproduce:
Hide

This is hard to reproduce reliably, since it's essentially a race condition, but here are the general steps that seem to trigger the error:

App starts a query which needs more than one batch of results to complete. e.g. default batch size is unchanged and the query returns >101 results.

A mongos instance sends the first batch to the app.

The same mongos instance is taken out of rotation by the load balancer and the SRV record is updated such that the instance isn't listed there anymore.

Driver refreshes the SRV record in the background and removes the mongos instance from its list of available instances.

App finishes processing the first batch of results and the driver transparently requests the second batch, but this time it has to query a different mongos instance, since the original one is no longer available. This other mongos instance doesn't have the original cursor and returns the "cursor not found" error.
Show
This is hard to reproduce reliably, since it's essentially a race condition, but here are the general steps that seem to trigger the error: App starts a query which needs more than one batch of results to complete. e.g. default batch size is unchanged and the query returns >101 results. A mongos instance sends the first batch to the app. The same mongos instance is taken out of rotation by the load balancer and the SRV record is updated such that the instance isn't listed there anymore. Driver refreshes the SRV record in the background and removes the mongos instance from its list of available instances. App finishes processing the first batch of results and the driver transparently requests the second batch, but this time it has to query a different mongos instance, since the original one is no longer available. This other mongos instance doesn't have the original cursor and returns the "cursor not found" error.
Confidence Status:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

When using the mongodb+srv:// scheme in the connection string in a sharded cluster, if the load balancer in front of mongos instances takes an instance out of rotation and updates the SRV record, apps could get a "Cursor not found" error if they're in the middle of executing a multi-batch cursor read against that instance. These errors require special handling within the app code. This seems like a bug, because neither the app nor the mongos operator can do anything to prevent these errors from occurring.

As an app developer I'd like for this situation to be handled transparently by either the server or the driver. I'm unfamiliar with Mongo internals, so the following suggestions/questions may be naive:

Could mongos replicate live cursors to other mongos instances such that drivers could access any mongos instance?
Could cursor information be sent to the driver, so the driver could itself detect this situation and send the cursor info to another mongos instance and have it resume cursor iteration?
Could the driver preserve old mongos instances in its local SRV record cache for some configurable duration? This duration could then be set to maximum expected query runtime within the application, and the mongos deployment updated to match (so it doesn't shut down earlier than that).
Could the driver transparently manage batching without resorting to stateful server-side cursors? i.e. could the driver ensure that there's only one batch per query, even if several one-batch queries need to be made to satisfy the query the app is making?

Assignee:: Unassigned

Reporter:: D V

Participants:: Chris Kelly, D V

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Created:: Mar 15 2024 04:57:49 PM UTC

Updated:: Nov 28 2024 03:53:22 PM UTC

GA Target Date:: None

Public Preview Target Date:: None

Private Preview Target Date:: None

Experiment Target Date:: None

Details

Description

Attachments

Activity

People

Dates