Uploaded image for project: 'Spark Connector'
  1. Spark Connector
  2. SPARK-94

Default partitioner that uses the aggregation pipeline to calculate partitions

    • Type: Icon: New Feature New Feature
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 1.1.0, 2.0.0
    • Component/s: None
    • None

      The default Partitioner strategy ignores the aggregation pipeline completely, causing a lot of empty partitions that slow down the Spark application.

      There should be a Partitioner strategy that is able to create partitions based on the aggregation pipeline that will be executed.

      In the Spark WebUI of my application, I can see a lot of partitions being created and processed eventhough they are completely empty. My aggregation is trimming down the expected count of documents from several hundred millions to just hundreds. The current default partitioner seems to ignore this and builds 242 partitions. See the attached images for clarification:

      After calling saveAsTextFile on the RDD, I can see a lot of these partitions were in fact empty:

      I am using the SparkConnector version 1.1.0 because we run Spark 1.6.1, but this feature is also not available for 2.0.0.

        1. partitions.PNG
          partitions.PNG
          82 kB
        2. mongo-spark3.PNG
          mongo-spark3.PNG
          38 kB

            Assignee:
            Unassigned Unassigned
            Reporter:
            j9dy F H
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: