Uploaded image for project: 'Spark Connector'
  1. Spark Connector
  2. SPARK-241

Ignore duplicateds on Save

    • Type: Icon: Task Task
    • Resolution: Done
    • Priority: Icon: Minor - P4 Minor - P4
    • None
    • Affects Version/s: None
    • Component/s: Writes
    • None

      I have a CSV file that I need to save in mongo. My collection already have some data (couple millions) and I need to save into the database just the new ones and ignore the ones that were already in the collection.

      How can i do that? I already have the code below:

       

      Map<String, String> writeOverrides = new HashMap<String, String>(); 
      writeOverrides.put("collection", this.collection);
      writeOverrides.put("replaceDocument", "false");
      writeOverrides.put("ordered", "false");
      WriteConfig writeConfig = WriteConfig.create(getJavaSparkContext()).withOptions(writeOverrides);
      MongoSpark.save(ds.write().mode(SaveMode.Ignore), writeConfig);
      

       

       

      I've already tried all the SaveModes and none of them worked the way I need.

       

      PS: I'm using only _id as index

            Assignee:
            ross@mongodb.com Ross Lawley
            Reporter:
            pedro.dib Pedro Dib
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: