-
Type: Improvement
-
Resolution: Done
-
Priority: Minor - P4
-
Affects Version/s: None
-
Component/s: None
-
None
The Spark catalyst engine has a relatively small number of supported data types. Currently, ObjectId's are cast to a string but when saving back to MongoDB it loses that type information.
Was: Identify and wrap _id columns with ObjectId when writing the dataframe
When reading from mongo, the _id attribute is represented as string in the DataFrame. Given that you might do some transformations and later write back to mongo, the _id attribute is written as pure string. I wonder if it would be possible to detect whether a value is a valid ObjectId and wrap it when storing the dataframe back into Mongo?
row.schema.fields.zipWithIndex.foreach({ case (field, i) => val data = field.dataType match { case arrayField: ArrayType if !row.isNullAt(i) => arrayTypeToData(arrayField, row.getSeq(i)) case subDocument: StructType if !row.isNullAt(i) => rowToDocument(row.getStruct(i)) case _ => if (field.name == "_id" && field.dataType.typeName == "string") new ObjectId(row.getString(i)) else row.get(i) } document.append(field.name, data) })
In the rowToDocument function.
I could imagine that maybe a regex test could be in place to make sure it is a valid ObjectID or alternatively use the StructField metadata to indicate that the column is an objectId when inferring the schema?
- related to
-
SPARK-44 Mongo Date type breaks load
- Closed