Uploaded image for project: 'Spark Connector'
  1. Spark Connector
  2. SPARK-327

Support for handling Corrupt/Bad records on spark read

    • Type: Icon: New Feature New Feature
    • Resolution: Fixed
    • Priority: Icon: Minor - P4 Minor - P4
    • 10.4.0
    • Affects Version/s: 3.0.1
    • Component/s: Reads, Schema
    • Needed
    • Hide

      1. What would you like to communicate to the user about this feature?

      Added two new read configuration options:

      Added `mode` configuration allowing for different parsing strategies when handling documents that don't match the expected schema during reads.

      The options are:

      • `FAILFAST` (default) throw an exception when parsing a document that doesn't match the schema.
      • `PERMISSIVE` Sets any invalid fields to `null`.
        Combine with the `columnNameOfCorruptRecord` configuration if you want to store any invalid documents
        as an extended json string.
      • `DROPMALFORMED` ignores the whole document.

      Adds the `columnNameOfCorruptRecord` configuration whic extends the `PERMISSIVE` mode. When configured it
      saves the whole invalid document as extended json in that column, as long as its defined in the Schema. Inferred
      schemas will add the `columnNameOfCorruptRecord` column if set and the `mode` is `PERMISSIVE`.

      Note: Names derive from existing spark json configurations, from where this feature takes inspiration.

      3. Which versions of the driver/connector does this apply to?
      10.4.0

      Show
      1. What would you like to communicate to the user about this feature? Added two new read configuration options: Added `mode` configuration allowing for different parsing strategies when handling documents that don't match the expected schema during reads. The options are: `FAILFAST` (default) throw an exception when parsing a document that doesn't match the schema. `PERMISSIVE` Sets any invalid fields to `null`. Combine with the `columnNameOfCorruptRecord` configuration if you want to store any invalid documents as an extended json string. `DROPMALFORMED` ignores the whole document. Adds the `columnNameOfCorruptRecord` configuration whic extends the `PERMISSIVE` mode. When configured it saves the whole invalid document as extended json in that column, as long as its defined in the Schema. Inferred schemas will add the `columnNameOfCorruptRecord` column if set and the `mode` is `PERMISSIVE`. Note: Names derive from existing spark json configurations, from where this feature takes inspiration. 3. Which versions of the driver/connector does this apply to? 10.4.0

      Summary

      MongoTypeConversionException is thrown during spark read in presence of bad/corrupt fields in large collection. Adding support for modes like Permissive or DropMalformed as Mongo spark options will help in successful completion of MongoSpark read.

      Motivation

      Who is the affected end user?

      Big data management companies

      How does this affect the end user?

      Dataframe read operation breaks in presence of corrupt records.

      How likely is it that this problem or use case will occur?

      Any huge Mongo collection holding unstructured data where scanning entire collection to infer schema results in performance overhead.

      Whenever explicit schema is passed during spark dataframe read.

      If the problem does occur, what are the consequences and how severe are they?

      Failover - Spark read fails with MongoTypeConversionException even in presence of one corrupt record in collection of 1000x rows.

      Is this issue urgent?

      Yes, breaks in read operation will be prevented.

      Is this ticket required by a downstream team?

      Needed by e.g. Atlas, Shell, Compass?

      Is this ticket only for tests?

      No

      Cast of Characters

      Engineering Lead:
      Document Author:
      POCers:
      Product Owner:
      Program Manager:
      Stakeholders:

      Channels & Docs

      Slack Channel

      [Scope Document|some.url]

      [Technical Design Document|some.url]

            Assignee:
            ross@mongodb.com Ross Lawley
            Reporter:
            santhoshsuresh95@gmail.com Santhosh Suresh
            Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: