Uploaded image for project: 'Spark Connector'
  1. Spark Connector
  2. SPARK-375

failed to infer schema of array field if there is data with empty array value

    • Type: Icon: Improvement Improvement
    • Resolution: Fixed
    • Priority: Icon: Unknown Unknown
    • 10.1.0
    • Affects Version/s: None
    • Component/s: None
    • None

      What did I do

      I use mongodb spark connector to dump data from mongodb to databricks.

      I have two records in mongodb

      properties
      [\{kind: 234}, \{value: "orange"}, \{_id:"abc}]
      []

      This schema of this column is inferred as an array of StringType.

      What do I want

      This schema of this column should be inferred as an array of StructType(StructField(kind,IntegerType,true),StructField(value,StringType,true),StructField(_id,StringType,true)).

      Why do I need it

      I need to dump data from mongodb to databricks table batch by batch.

      Now the column is inferred as array of string in one batch, but array of struct in another batch. As a result, I will receive error when I try to merge this two batches

      AnalysisException: Failed to merge fields 'xxx' and 'xxx'. Failed to merge incompatible data types StringType and StructType(StructField(kind,IntegerType,true),StructField(value,StringType,true),StructField(_id,StringType,true))
      

      I want to have a consistant schema between batches.

      Having https://jira.mongodb.org/projects/SPARK/issues/SPARK-365 may help on resolving this issue.

            Assignee:
            ross@mongodb.com Ross Lawley
            Reporter:
            me@kytse.com Kit Yam Tse
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: