Loading...

XML

Word

Printable

JSON

Type: New Feature
Resolution: Fixed
Priority: Minor - P4
Fix Version/s: 10.4.0
Affects Version/s: 10.0.4
Component/s: None
Labels:
None

Quarter:
- FY23Q4
Case:

Documentation Changes:
Needed
Documentation Changes Summary:
Hide

1. What would you like to communicate to the user about this feature?

Added Schema hints configuration for use when inferring schemas.

Added a new read configuration: schemaHints.

Users can now supply a partial schema to enforce the schema about certain known field types when inferring schema for the collection.

2. Would you like the user to see examples of the syntax and/or executable code and its output?

Supports the following Spark formats:

DDL: value STRING,count INT
SQL DDL: STRUCT<value: STRING, count: INT>
JSON:

{"type":"struct","fields":[ {"name":"value","type":"string","nullable":true}, {"name":"count","type":"integer","nullable":true}]}

Users can create the schema DDL or Json schema strings simply use the Spark shell:

import org.apache.spark.sql.types._ val mySchema = StructType(Seq(StructField("value", StringType), StructField("count", IntegerType))) mySchema.toDDL mySchema.sql mySchema.simpleString mySchema.json

Or in PySpark:

from pyspark.sql.types import StructType, StructField, StringType, IntegerType mySchema = StructType([ StructField('value', StringType(), True), StructField('count', IntegerType(), True)]) mySchema.simpleString() mySchema.json()

3. Which versions of the driver/connector does this apply to?

The upcoming 10.4.0 version.
Show
1. What would you like to communicate to the user about this feature? Added Schema hints configuration for use when inferring schemas. Added a new read configuration: schemaHints . Users can now supply a partial schema to enforce the schema about certain known field types when inferring schema for the collection. 2. Would you like the user to see examples of the syntax and/or executable code and its output? Supports the following Spark formats: DDL: value STRING,count INT SQL DDL: STRUCT<value: STRING, count: INT> JSON: { "type" : "struct" , "fields" :[ { "name" : "value" , "type" : "string" , "nullable" : true }, { "name" : "count" , "type" : "integer" , "nullable" : true }]} Users can create the schema DDL or Json schema strings simply use the Spark shell: import org.apache.spark.sql.types._ val mySchema = StructType(Seq(StructField( "value" , StringType), StructField( "count" , IntegerType))) mySchema.toDDL mySchema.sql mySchema.simpleString mySchema.json Or in PySpark: from pyspark.sql.types import StructType, StructField, StringType, IntegerType mySchema = StructType([ StructField( 'value' , StringType(), True), StructField( 'count' , IntegerType(), True)]) mySchema.simpleString() mySchema.json() 3. Which versions of the driver/connector does this apply to? The upcoming 10.4.0 version.

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

Background

I have a mongo collection that has 8M doc with a lot of fields. Half of the doc have a field metadata in string, and the rest of them have metadata in object type.

What is my issue

When I dump the data from mongodb to databricks using mongodb spark connector, sometime it success and sometime I get

com.mongodb.spark.sql.connector.exceptions.DataException: Invalid field: 'uriMetadata'. The dataType 'struct' is invalid for 'BsonString{value='xxxxxxx'}'.

I think the failure is because the connector infers schema with only doc have object type metadata value. And then, the metadata column becomes a struct column in databricks, and the job fails because we can't insert string data to a struct column.

What do I want

I would like to have something similar to schemaHints in mongodb spark connector, such that I can provide a schema hint to only metadata column, suggesting it to be a string column.

What have I considered

Increase sampleSize
I know I can set the sampleSize to increase the chance that the connector infer schema from sample that contains string metadata value. However, it is still not guarantee string metadata value will be included in the sample and metadata can still be inferring as a struct column
Provide full schema with .schema(my_schjema)
My collection has many fields and complicated nested schema. We may also introduce new fields to the collections from time to time. It is difficult for me to define a full schema of the collections. As a result, I would just like to partially define the schema for some fields only.

related to

SPARK-375 failed to infer schema of array field if there is data with empty array value

Closed

Assignee:: Ross Lawley
Reporter:: Kit Yam Tse
Reviewers:: None
Votes:: 0 Vote for this issue
Watchers:: 5 Start watching this issue

Created:: Sep 02 2022 03:41:10 AM UTC
Updated:: Jul 01 2024 08:32:20 AM UTC
Resolved:: Jul 01 2024 08:32:18 AM UTC

Details

Description

Background

What is my issue

What do I want

What have I considered

Attachments

Issue Links

Activity

People

Dates