failed to infer schema of array field if there is data with empty array value

XMLWordPrintableJSON

    • Type: Improvement
    • Resolution: Fixed
    • Priority: Unknown
    • 10.1.0
    • Affects Version/s: None
    • Component/s: None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      What did I do

      I use mongodb spark connector to dump data from mongodb to databricks.

      I have two records in mongodb

      properties
      [\{kind: 234}, \{value: "orange"}, \{_id:"abc}]
      []

      This schema of this column is inferred as an array of StringType.

      What do I want

      This schema of this column should be inferred as an array of StructType(StructField(kind,IntegerType,true),StructField(value,StringType,true),StructField(_id,StringType,true)).

      Why do I need it

      I need to dump data from mongodb to databricks table batch by batch.

      Now the column is inferred as array of string in one batch, but array of struct in another batch. As a result, I will receive error when I try to merge this two batches

      AnalysisException: Failed to merge fields 'xxx' and 'xxx'. Failed to merge incompatible data types StringType and StructType(StructField(kind,IntegerType,true),StructField(value,StringType,true),StructField(_id,StringType,true))
      

      I want to have a consistant schema between batches.

      Having https://jira.mongodb.org/projects/SPARK/issues/SPARK-365 may help on resolving this issue.

            Assignee:
            Ross Lawley
            Reporter:
            Kit Yam Tse
            None
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: