Uploaded image for project: 'Spark Connector'
  1. Spark Connector
  2. SPARK-40

Schema inference on structs in an array doesn't merge schemas

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • 0.2
    • Affects Version/s: None
    • Component/s: Schema
    • None

      The issue this time is with arrays of objects, namely schema inference on them.

      If I have records in the form of:

      {a:1, b:[{c:2, d:3, e:4}]}
      {a:2, b:[{c:2, e:4, f:5}]}
      

      This works fine, the schema inference creates a schema with nullable fields: c, d, e, f.

      If you however add another object into the array, which differs in schema from the others I get ConflictType.
      e.g.

      {a:2, b:[{c:2, f:5, g:6}, {c:3, h:3, z:1}]}
      

      I would think that the correct behavior in this case should still be to get array type of structs with nullable fields: c, d, e, f, g, h, z.

      I was able to fix this locally by changing

                  case false => {
                    val areEqual: Boolean = arrayType == previous
                    if (!areEqual) arrayType = Some(ConflictType) 
                    areEqual
                  }
      

      In getSchemaFromArray to:

                  case false => {
                    val areEqual: Boolean = arrayType == previous
                    if (!areEqual) arrayType = Some(compatibleType(arrayType.get, previous.get))
                    areEqual
                  }
      

      Which if I understand it correct performs schema inference not only on structs of other arrays, but also structs within one array.

            Assignee:
            Unassigned Unassigned
            Reporter:
            lokm01 Jan Scherbaum
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: