Background
I have a mongo collection that has 8M doc with a lot of fields. Half of the doc have a field metadata in string, and the rest of them have metadata in object type.
What is my issue
When I dump the data from mongodb to databricks using mongodb spark connector, sometime it success and sometime I get
com.mongodb.spark.sql.connector.exceptions.DataException: Invalid field: 'uriMetadata'. The dataType 'struct' is invalid for 'BsonString{value='xxxxxxx'}'.
I think the failure is because the connector infers schema with only doc have object type metadata value. And then, the metadata column becomes a struct column in databricks, and the job fails because we can't insert string data to a struct column.
What do I want
I would like to have something similar to schemaHints in mongodb spark connector, such that I can provide a schema hint to only metadata column, suggesting it to be a string column.
What have I considered
- Increase sampleSize
I know I can set the sampleSize to increase the chance that the connector infer schema from sample that contains string metadata value. However, it is still not guarantee string metadata value will be included in the sample and metadata can still be inferring as a struct column - Provide full schema with .schema(my_schjema)
My collection has many fields and complicated nested schema. We may also introduce new fields to the collections from time to time. It is difficult for me to define a full schema of the collections. As a result, I would just like to partially define the schema for some fields only.
- related to
-
SPARK-375 failed to infer schema of array field if there is data with empty array value
- Closed