Uploaded image for project: 'Spark Connector'
  1. Spark Connector
  2. SPARK-365

Add schemaHints option for inferring schema

    • Type: Icon: New Feature New Feature
    • Resolution: Fixed
    • Priority: Icon: Minor - P4 Minor - P4
    • 10.4.0
    • Affects Version/s: 10.0.4
    • Component/s: None
    • None
    • Needed
    • Hide

      1. What would you like to communicate to the user about this feature?

      Added Schema hints configuration for use when inferring schemas.

      Added a new read configuration: schemaHints.

      Users can now supply a partial schema to enforce the schema about certain known field types when inferring schema for the collection.

      2. Would you like the user to see examples of the syntax and/or executable code and its output?

      Supports the following Spark formats:

      DDL: value STRING,count INT
      SQL DDL: STRUCT<value: STRING, count: INT>
      JSON:

      {"type":"struct","fields":[
      {"name":"value","type":"string","nullable":true},
      {"name":"count","type":"integer","nullable":true}]}
      

      Users can create the schema DDL or Json schema strings simply use the Spark shell:

      import org.apache.spark.sql.types._
      val mySchema = StructType(Seq(StructField("value", StringType), StructField("count", IntegerType)))
      
      mySchema.toDDL
      mySchema.sql
      mySchema.simpleString
      mySchema.json
      

      Or in PySpark:

      from pyspark.sql.types import StructType, StructField, StringType, IntegerType
      mySchema = StructType([ StructField('value', StringType(), True), StructField('count', IntegerType(), True)])
      
      mySchema.simpleString()
      mySchema.json()
      

      3. Which versions of the driver/connector does this apply to?

      The upcoming 10.4.0 version.

      Show
      1. What would you like to communicate to the user about this feature? Added Schema hints configuration for use when inferring schemas. Added a new read configuration: schemaHints . Users can now supply a partial schema to enforce the schema about certain known field types when inferring schema for the collection. 2. Would you like the user to see examples of the syntax and/or executable code and its output? Supports the following Spark formats: DDL: value STRING,count INT SQL DDL: STRUCT<value: STRING, count: INT> JSON: { "type" : "struct" , "fields" :[ { "name" : "value" , "type" : "string" , "nullable" : true }, { "name" : "count" , "type" : "integer" , "nullable" : true }]} Users can create the schema DDL or Json schema strings simply use the Spark shell: import org.apache.spark.sql.types._ val mySchema = StructType(Seq(StructField( "value" , StringType), StructField( "count" , IntegerType))) mySchema.toDDL mySchema.sql mySchema.simpleString mySchema.json Or in PySpark: from pyspark.sql.types import StructType, StructField, StringType, IntegerType mySchema = StructType([ StructField( 'value' , StringType(), True), StructField( 'count' , IntegerType(), True)]) mySchema.simpleString() mySchema.json() 3. Which versions of the driver/connector does this apply to? The upcoming 10.4.0 version.

      Background

      I have a mongo collection that has 8M doc with a lot of fields. Half of the doc have a field metadata in string, and the rest of them have metadata in object type.

      What is my issue

      When I dump the data from mongodb to databricks using mongodb spark connector, sometime it success and sometime I get

      com.mongodb.spark.sql.connector.exceptions.DataException: Invalid field: 'uriMetadata'. The dataType 'struct' is invalid for 'BsonString{value='xxxxxxx'}'.
      

      I think the failure is because the connector infers schema with only doc have object type metadata value. And then, the metadata column becomes a struct column in databricks, and the job fails because we can't insert string data to a struct column.

      What do I want

      I would like to have something similar to schemaHints in mongodb spark connector, such that I can provide a schema hint to only metadata column, suggesting it to be a string column.

      What have I considered

      1. Increase sampleSize
        I know I can set the sampleSize to increase the chance that the connector infer schema from sample that contains string metadata value. However, it is still not guarantee string metadata value will be included in the sample and metadata can still be inferring as a struct column
      2. Provide full schema with .schema(my_schjema)
        My collection has many fields and complicated nested schema. We may also introduce new fields to the collections from time to time. It is difficult for me to define a full schema of the collections. As a result, I would just like to partially define the schema for some fields only.

            Assignee:
            ross@mongodb.com Ross Lawley
            Reporter:
            me@kytse.com Kit Yam Tse
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: