Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: Documentation
Labels:
None

I run an aggregation pipeline and use the returned documents to build new subsequent queries. Because the SparkContext is not available in the distributed RDD tasks on the Executors, I need an alternative.

The documentation is not offering any information on how to do these subsequent queries. I was expecting to get a somewhat "best practice" advice because this seems like a common use case.

My question is which option is better:

In the RDD tasks (for example map()), use the manual MongoDB driver to connect to the MongoDB cluster, fire the subsequent query, await results and return them as the result of the map() function
collect() the previously created queries to bring them to the driver, create the aggregation pipeline RDDs for each of this queries

The second approach however is not asynchronous. The driver will block by calling collect() until all results have been gathered. Calling collectAsync will not block, but afaik accessing intermediate results is also not possible.

Any hints on this problem? I wasn't sure whether to set the issue type to improvement or question, I chose improvement because I thought it was a little bit of both and the documentation can take a few words on this matter.

Assignee:: Unassigned

Reporter:: F H

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: Nov 16 2016 11:53:50 AM UTC

Updated:: Sep 22 2021 06:51:47 PM UTC

Resolved:: Nov 17 2016 01:30:31 PM UTC

Details

Description

Attachments

Activity

People

Dates