Loading...

XML

Word

Printable

JSON

Type: New Feature
Resolution: Fixed
Priority: Minor - P4
Fix Version/s: 10.4.0
Affects Version/s: 3.0.1
Component/s: Reads, Schema
Labels:
- external-user

Documentation Changes:
Needed
Documentation Changes Summary:
Hide

1. What would you like to communicate to the user about this feature?

Added two new read configuration options:

Added `mode` configuration allowing for different parsing strategies when handling documents that don't match the expected schema during reads.

The options are:

`FAILFAST` (default) throw an exception when parsing a document that doesn't match the schema.

`PERMISSIVE` Sets any invalid fields to `null`.
Combine with the `columnNameOfCorruptRecord` configuration if you want to store any invalid documents
as an extended json string.

`DROPMALFORMED` ignores the whole document.

Adds the `columnNameOfCorruptRecord` configuration whic extends the `PERMISSIVE` mode. When configured it
saves the whole invalid document as extended json in that column, as long as its defined in the Schema. Inferred
schemas will add the `columnNameOfCorruptRecord` column if set and the `mode` is `PERMISSIVE`.

Note: Names derive from existing spark json configurations, from where this feature takes inspiration.

3. Which versions of the driver/connector does this apply to?
10.4.0
Show
1. What would you like to communicate to the user about this feature? Added two new read configuration options: Added `mode` configuration allowing for different parsing strategies when handling documents that don't match the expected schema during reads. The options are: `FAILFAST` (default) throw an exception when parsing a document that doesn't match the schema. `PERMISSIVE` Sets any invalid fields to `null`. Combine with the `columnNameOfCorruptRecord` configuration if you want to store any invalid documents as an extended json string. `DROPMALFORMED` ignores the whole document. Adds the `columnNameOfCorruptRecord` configuration whic extends the `PERMISSIVE` mode. When configured it saves the whole invalid document as extended json in that column, as long as its defined in the Schema. Inferred schemas will add the `columnNameOfCorruptRecord` column if set and the `mode` is `PERMISSIVE`. Note: Names derive from existing spark json configurations, from where this feature takes inspiration. 3. Which versions of the driver/connector does this apply to? 10.4.0

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

Summary

MongoTypeConversionException is thrown during spark read in presence of bad/corrupt fields in large collection. Adding support for modes like Permissive or DropMalformed as Mongo spark options will help in successful completion of MongoSpark read.

Motivation

Who is the affected end user?

Big data management companies

How does this affect the end user?

Dataframe read operation breaks in presence of corrupt records.

How likely is it that this problem or use case will occur?

Any huge Mongo collection holding unstructured data where scanning entire collection to infer schema results in performance overhead.

Whenever explicit schema is passed during spark dataframe read.

If the problem does occur, what are the consequences and how severe are they?

Failover - Spark read fails with MongoTypeConversionException even in presence of one corrupt record in collection of 1000x rows.

Is this issue urgent?

Yes, breaks in read operation will be prevented.

Is this ticket required by a downstream team?

Needed by e.g. Atlas, Shell, Compass?

Is this ticket only for tests?

Cast of Characters

Engineering Lead:
Document Author:
POCers:
Product Owner:
Program Manager:
Stakeholders:

Channels & Docs

Slack Channel

[Scope Document|some.url]

[Technical Design Document|some.url]

Assignee:: Ross Lawley
Reporter:: Santhosh Suresh
Reviewers:: None
Votes:: 1 Vote for this issue
Watchers:: 2 Start watching this issue

Created:: Mar 18 2022 12:20:31 AM UTC
Updated:: Jul 15 2024 11:00:16 AM UTC
Resolved:: Jul 15 2024 11:00:14 AM UTC

Details

Description