-
Type: Task
-
Resolution: Done
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Documentation
-
None
I run an aggregation pipeline and use the returned documents to build new subsequent queries. Because the SparkContext is not available in the distributed RDD tasks on the Executors, I need an alternative.
The documentation is not offering any information on how to do these subsequent queries. I was expecting to get a somewhat "best practice" advice because this seems like a common use case.
My question is which option is better:
- In the RDD tasks (for example map()), use the manual MongoDB driver to connect to the MongoDB cluster, fire the subsequent query, await results and return them as the result of the map() function
- collect() the previously created queries to bring them to the driver, create the aggregation pipeline RDDs for each of this queries
The second approach however is not asynchronous. The driver will block by calling collect() until all results have been gathered. Calling collectAsync will not block, but afaik accessing intermediate results is also not possible.
Any hints on this problem? I wasn't sure whether to set the issue type to improvement or question, I chose improvement because I thought it was a little bit of both and the documentation can take a few words on this matter.