Uploaded image for project: 'Kafka Connector'
  1. Kafka Connector
  2. KAFKA-131

Copy existing configuration with pipeline

    • Type: Icon: Improvement Improvement
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 1.3.0
    • Affects Version/s: 1.2.0
    • Component/s: Source
    • None
    • Environment:
      Kafka Connector: 1.2.0
      MongoDb version: 3.6.17
    • Needed
    • Hide

      Added a new configuration:

      copy.existing.pipeline=[{"$match": {"closed": "false"}}]

      An inline JSON array with objects describing the pipeline operations to run when copying existing data. This can improve the use of indexes by the copying manager and make copying more efficient.

      Use if there is any filtering of collection data in the `pipeline` configuration to speed up the copying process

      Show
      Added a new configuration: copy.existing.pipeline= [{"$match": {"closed": "false"}}] An inline JSON array with objects describing the pipeline operations to run when copying existing data. This can improve the use of indexes by the copying manager and make copying more efficient. Use if there is any filtering of collection data in the `pipeline` configuration to speed up the copying process

      We are trying to do copy existing data in huge collections(around 6 million documents). our requirement is such that we need a specific set of data and not all data. so in the configuration, we provide pipeline similar to:

      "pipeline": "[
        { $project: { "updateDescription":0 } }, 
        { $match: {"fullDocument.createdDate":{ "$gt": ISODate("2019-03-31T13:44:54.791Z"), "$lt": ISODate("2020-07-23T13:44:54.791Z")} } } 
      ]". 
      

      Mongodb logs show the lookup seems to be very expensive. From the connector code, it looks up the entire collection and applies the filter https://github.com/mongodb/mongo-kafka/blob/master/src/main/java/com/mongodb/kafka/connect/source/MongoCopyDataManager.java#L147 The pipeline configuration is added at the end so it looks up the entire collection and applies the data. Is there an option or a way to add the provided pipeline configuration at the beginning of the list.

      Also, please provide us other configuration option available to make the copy data effective. Thanks 

            Assignee:
            ross@mongodb.com Ross Lawley
            Reporter:
            sabari.mgn@gmail.com Sabari Gandhi
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: