-
Type: Bug
-
Resolution: Unresolved
-
Priority: Minor - P4
-
None
-
Affects Version/s: None
-
Component/s: None
-
Environment:MongoDB 5 data and config nodes run on GCP VMs. Mongos runs as a Kubernetes deployment with a Kubernetes headless service acting as the LB that serves SRV records. The driver we're using is Go 1.13.2
-
Catalog and Routing
-
ALL
-
When using the mongodb+srv:// scheme in the connection string in a sharded cluster, if the load balancer in front of mongos instances takes an instance out of rotation and updates the SRV record, apps could get a "Cursor not found" error if they're in the middle of executing a multi-batch cursor read against that instance. These errors require special handling within the app code. This seems like a bug, because neither the app nor the mongos operator can do anything to prevent these errors from occurring.
As an app developer I'd like for this situation to be handled transparently by either the server or the driver. I'm unfamiliar with Mongo internals, so the following suggestions/questions may be naive:
- Could mongos replicate live cursors to other mongos instances such that drivers could access any mongos instance?
- Could cursor information be sent to the driver, so the driver could itself detect this situation and send the cursor info to another mongos instance and have it resume cursor iteration?
- Could the driver preserve old mongos instances in its local SRV record cache for some configurable duration? This duration could then be set to maximum expected query runtime within the application, and the mongos deployment updated to match (so it doesn't shut down earlier than that).
- Could the driver transparently manage batching without resorting to stateful server-side cursors? i.e. could the driver ensure that there's only one batch per query, even if several one-batch queries need to be made to satisfy the query the app is making?