Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 10.2.0
Affects Version/s: 10.1.0
Component/s: None
Labels:
None

Documentation Changes:
Not Needed
Documentation Changes Summary:

Hide

1. What would you like to communicate to the user about this feature?
2. Would you like the user to see examples of the syntax and/or executable code and its output?
3. Which versions of the driver/connector does this apply to?

Show
1. What would you like to communicate to the user about this feature? 2. Would you like the user to see examples of the syntax and/or executable code and its output? 3. Which versions of the driver/connector does this apply to?

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

Hello,

TL;DR when loading a MongoDB collection in Spark using Mongo Spark connector 10.1.0 or 10.1.1, I cannot filter on columns for which the column name contains one or more dashes (e.g. "my-column-name"). The filtering operation always leaves me with no data at all. This was working fine with Mongo Spark connector 10.0.5.

We recently moved to Mongo Spark connector 10.1.0 and had to revert to 10.0.5 as it caused our Spark job to behave abnormally.

We're coding in Java 17, using Scala 2.12 & Spark 3.3.1 and reading data from MongoDB collections. To summarise our investigations, we observe the following behaviour (omitting unnecessary details like the "_id" field):

var data = sparkSession.read()
                       .format("mongodb")
                       // ...
                       .load();
data.show();
// The data we're using...
// +----+--------+
// |okay|not-okay|
// +----+--------+
// |null|foo     |
// |bar |    null|
// +----+--------+
data.filter(col("okay").isNotNull()).show();
// This filters data as expected...
// +----+--------+
// |okay|not-okay|
// +----+--------+
// |bar |    null|
// +----+--------+
data.filter(col("not-okay").isNotNull()).show();
// But not if the column name has a dash!
// +----+--------+
// |okay|not-okay|
// +----+--------+
// +----+--------+

The behaviour didn't change whether I was using "filter" or "where". Similarly, expressing the condition as a column or a string always leaves me with no data. Renaming the column using "withColumnRenamed" doesn't help. Creating a new column using "withColumn" does help, so it feels like the original column is somehow corrupted. Also, using "coalesce(col("not-okay"), lit(false))" does help.

We have many MongoDB collections in which field keys contains one or more dashes and it is really impractical to change the collections (we don't own them) or hack our Spark jobs to avoid this.

I kind of feel helpless after having spent a couple of days to figure out what was going on with our Spark jobs. I hope this issue will help!

Regards,

Sébastien

related to

SPARK-432 Broken reading filter on structures field with dash character

Closed

Assignee:: Ross Lawley
Reporter:: Sébastien Burton
Reviewers:: None
Votes:: 0 Vote for this issue
Watchers:: 4 Start watching this issue

Created:: Feb 16 2023 04:04:28 PM UTC
Updated:: Jun 25 2024 05:52:54 AM UTC
Resolved:: Jun 06 2023 09:05:30 AM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates