Implement the ability to specify multiple collections in the 'MongoDB Spark Connector' connection in Streaming mode.
Currently the MongoDB Spark connector is limited to interacting with only one MongoDB collection during each read or write operation. As a result, it does not natively support reading or writing from multiple collections, simultaneously in a single operation. This requires creating a new Spark Connector connection for each collection, resulting in a bad developer experience, and in cases with large number of collections prevents the developer from being able to use the steaming mode outright.
The workaround we are providing customer involves creating a loop that iterates over the list of collections they want to read from, and for each collection, use the MongoDB Spark Connector to read the data into Spark. This would require the customer to stand up a separate service to implement this logic and manage mechanisms like connection failures, timeouts, etc.
Given that the spark connector in streaming mode uses MongoDB change streams, it should be possible to connect to all the collections of a given database, in a single Spark connector connection. Further it the API can be extended to connect to multiple databases in the cluster, when setting up a Mongodb Spark connector connection in streaming mode, it would lead to a lot of benefit.
ex customer request: https://www.mongodb.com/community/forums/t/pyspark-get-list-of-collections/225327
asks "I would like to execute a query across multiple collections but avoid creating a new spark read session each time I do so."
- links to