Uploaded image for project: 'Spark Connector'
  1. Spark Connector
  2. SPARK-393

Broken filter on column read from a MongoDB field with dash characters

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 10.2.0
    • Affects Version/s: 10.1.0
    • Component/s: None
    • None
    • Not Needed
    • Hide

      1. What would you like to communicate to the user about this feature?
      2. Would you like the user to see examples of the syntax and/or executable code and its output?
      3. Which versions of the driver/connector does this apply to?

      Show
      1. What would you like to communicate to the user about this feature? 2. Would you like the user to see examples of the syntax and/or executable code and its output? 3. Which versions of the driver/connector does this apply to?

      Hello,

      TL;DR when loading a MongoDB collection in Spark using Mongo Spark connector 10.1.0 or 10.1.1, I cannot filter on columns for which the column name contains one or more dashes (e.g. "my-column-name"). The filtering operation always leaves me with no data at all. This was working fine with Mongo Spark connector 10.0.5.

       

      We recently moved to Mongo Spark connector 10.1.0 and had to revert to 10.0.5 as it caused our Spark job to behave abnormally.

      We're coding in Java 17, using Scala 2.12 & Spark 3.3.1 and reading data from MongoDB collections. To summarise our investigations, we observe the following behaviour (omitting unnecessary details like the "_id" field):

      var data = sparkSession.read()
                             .format("mongodb")
                             // ...
                             .load();
      data.show();
      // The data we're using...
      // +----+--------+
      // |okay|not-okay|
      // +----+--------+
      // |null|foo     |
      // |bar |    null|
      // +----+--------+
      data.filter(col("okay").isNotNull()).show();
      // This filters data as expected...
      // +----+--------+
      // |okay|not-okay|
      // +----+--------+
      // |bar |    null|
      // +----+--------+
      data.filter(col("not-okay").isNotNull()).show();
      // But not if the column name has a dash!
      // +----+--------+
      // |okay|not-okay|
      // +----+--------+
      // +----+--------+ 

      The behaviour didn't change whether I was using "filter" or "where". Similarly, expressing the condition as a column or a string always leaves me with no data. Renaming the column using "withColumnRenamed" doesn't help. Creating a new column using "withColumn" does help, so it feels like the original column is somehow corrupted. Also, using "coalesce(col("not-okay"), lit(false))" does help.

      We have many MongoDB collections in which field keys contains one or more dashes and it is really impractical to change the collections (we don't own them) or hack our Spark jobs to avoid this.

      I kind of feel helpless after having spent a couple of days to figure out what was going on with our Spark jobs. I hope this issue will help!

      Regards,

      Sébastien

            Assignee:
            ross@mongodb.com Ross Lawley
            Reporter:
            sebastien.burton@soprabanking.com Sébastien Burton
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: