-
Type: Bug
-
Resolution: Fixed
-
Priority: Unknown
-
Affects Version/s: None
-
Component/s: Configuration
-
None
We are trying to query a Mongo database hosted in Atlas from an Azure Databricks cluster.
The Atlas database is hosted in an M10 cluster with three primary nodes in AWS (the ones used by the transactional application) with an additional read-only node in Azure (the one we are trying to connect).
We already set up a peering connection between our vnet and the Atlas one, plus whitelisting the appropriate IP range. We confirmed that we can pin the read-only node using its private DNS from one of the Databricks worker nodes; we even confirmed that we can telnet the 27017 port. Even more, using pymongo from one of the workers we are able to connect to the database and query the collections.
However, when we try to connect from Databricks we get some timeout errors which appear to be related to the mongo-spark-connector not honoring the readPreferece configuration.
This is the URI we are trying to use (omitting sensitive details)
mongodb+srv://<user>:<password>@<cluster>-pri.wrmoz.mongodb.net/<database>.<collection>?tls=true&readPreference=nearest&readPreferenceTags=provider:AZURE,region:US_EAST,nodeType:READ_ONLY&readConcernLevel=local
Yet, when trying to load the data as a DataFrame and perform a simple show() we get a connection time-out error.
The stack trace of the exception shows that the driver was able to ping the desired node (while being unable to reach the AWS ones, as expected). But, neglects to connect to it since it doesn’t match the expected readPreference
primary
We also tried to specify each of the parameters as individual options, using the global cluster config, or in-code config, we also tried using both the v10.0 and the v3.0 versions of the connector.
Nevertheless, no matter what we tried we always got the same error.