Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Done
Priority: Minor - P4
Fix Version/s: 0.2
Affects Version/s: None
Component/s: None
Labels:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

The Spark catalyst engine has a relatively small number of supported data types. Currently, ObjectId's are cast to a string but when saving back to MongoDB it loses that type information.

Was: Identify and wrap _id columns with ObjectId when writing the dataframe

When reading from mongo, the _id attribute is represented as string in the DataFrame. Given that you might do some transformations and later write back to mongo, the _id attribute is written as pure string. I wonder if it would be possible to detect whether a value is a valid ObjectId and wrap it when storing the dataframe back into Mongo?

    row.schema.fields.zipWithIndex.foreach({
      case (field, i) =>
        val data = field.dataType match {
          case arrayField: ArrayType if !row.isNullAt(i) => arrayTypeToData(arrayField, row.getSeq(i))
          case subDocument: StructType if !row.isNullAt(i) => rowToDocument(row.getStruct(i))
          case _ => if (field.name == "_id" && field.dataType.typeName == "string") new ObjectId(row.getString(i)) else row.get(i)
        }
        document.append(field.name, data)
    })

In the rowToDocument function.

I could imagine that maybe a regex test could be in place to make sure it is a valid ObjectID or alternatively use the StructField metadata to indicate that the column is an objectId when inferring the schema?

related to

SPARK-44 Mongo Date type breaks load

Closed

Assignee:: Unassigned

Reporter:: Jan Scherbaum

Reviewers:: None

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: May 09 2016 02:57:34 PM UTC

Updated:: May 20 2016 10:39:00 AM UTC

Resolved:: May 18 2016 10:13:19 AM UTC

Details

Description

Was: Identify and wrap _id columns with ObjectId when writing the dataframe

Attachments

Issue Links

Activity

People

Dates