Uploaded image for project: 'Spark Connector'
  1. Spark Connector
  2. SPARK-73

What's the behaviour of Data locality awareness and how to configure it?

    • Type: Icon: Task Task
    • Resolution: Done
    • Priority: Icon: Critical - P2 Critical - P2
    • None
    • Affects Version/s: 1.0.0
    • Environment:
      windows server 2008 R2 standard

      I want to use Spark Connector for spark cluster and shard mongoDB.
      According to the note of spark connector, it has a feature: Data locality awareness, described as The Spark connector is aware which MongoDB partitions are storing data.

      Question 1: What is the correct behavior of this feature? In my test environment, there are two machines installed shard mongoDB and using the spark cluster in these two machines. When I create RDDs using JavaMongoRDD<Document> items = MongoSpark.load(sc, readConfig);
      I found that the RDDs not always generated in the same worker getting the data from the mongo shard in same machine. It is random behavior.

      Question 2: How to configure this feature? I have searched the solutions and test below configuration but not work:
      1. readOverrides.put("readpreference.name", "nearest"); it seems for cluster disaster.
      2. readOverrides.put("partitioner", "MongoShardedPartitioner"); not work

            Assignee:
            Unassigned Unassigned
            Reporter:
            xiefeng xiefeng
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: