Title was: "Can't write json RDD to Mongo which Mongo is famous for schema-less design"
The problem is we can't use Python spark-mongo-connector to write RDD to MongoDB.
Note: we have tried the latest version.
Business impact:
all of Python users who are trying to write dynamic schema data into MongoDB.
Mongo is famous for schema-less design, however only scala spark-mongo-connector can write RDD with dynamic schema back to MongoDB, python users are suffering.
Dynamic Schema Challenge
Spark has RDD and DataFrame, by design, RDD supports dynamic schema, DataFrame only support explicit schema for better performance.
Mongo Spark Connector Scala API supports RDD read&write, but Python API does not. Python API only support DataFrame which will not support dynamic schema by design of Spark.
----Workaround for Read phase, completed
1. read Mongo documents to DF
2. dump data to Json String
3. transfer it to TD Spark application
----Blocking issue in Write phase, pending on Mongo Spark team
For write, we parse the string to dynamic schema dictionary into RDD, however we can't push it to connector without transfer to DataFrame.
I think we need to consulting with Mongo Spark Team, once Mongo Spark can support RDD writing, we can migrate all codes to Python.
Issue History:
1. RDD approach has been deprecated in mongo-hadoop project March 2016.
RDD saveAsNewAPIHadoopFile which used to write data into MongoDB has been deprecated.
rdd.saveAsNewAPIHadoopFile(
path='file:///this-is-unused',
outputFormatClass='com.mongodb.hadoop.MongoOutputFormat',
keyClass='org.apache.hadoop.io.Text',
valueClass='org.apache.hadoop.io.MapWritable',
conf=
)
Announced @: https://github.com/mongodb/mongo-hadoop/wiki/Spark-Usage
2. objectID issue: resolved.... when converting to DataFrame found: "TypeError: not supported type: <class 'bson.objectid.ObjectId'>"
tracking by: https://jira.mongodb.org/browse/HADOOP-277
Schema related issues:
3. StructType issue: "com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast ARRAY into a StructType"
tracking by: https://groups.google.com/forum/#!topic/mongodb-user/lQjppYa21mQ
4. repartition issue:
"Cannot cast ARRAY into a StructType(StructField(0,StringType,true), StructField(1,StringType,true), StructField(2,StringType,true), StructField(3,StringType,true), StructField(4,StringType,true)) (value: BsonArray{values=[BsonString
, BsonString
{value='Logic ICs'}]})"
tracking by: https://groups.google.com/forum/#!topic/mongodb-user/lQjppYa21mQ