Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Works as Designed
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 2.2.0
Component/s: API
Labels:
- python
Environment:
Spark with Mongo Connector

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

Title was: "Can't write json RDD to Mongo which Mongo is famous for schema-less design"

The problem is we can't use Python spark-mongo-connector to write RDD to MongoDB.
Note: we have tried the latest version.

Business impact:
all of Python users who are trying to write dynamic schema data into MongoDB.
Mongo is famous for schema-less design, however only scala spark-mongo-connector can write RDD with dynamic schema back to MongoDB, python users are suffering.

Dynamic Schema Challenge
Spark has RDD and DataFrame, by design, RDD supports dynamic schema, DataFrame only support explicit schema for better performance.
Mongo Spark Connector Scala API supports RDD read&write, but Python API does not. Python API only support DataFrame which will not support dynamic schema by design of Spark.

----Workaround for Read phase, completed
1. read Mongo documents to DF
2. dump data to Json String
3. transfer it to TD Spark application

----Blocking issue in Write phase, pending on Mongo Spark team
For write, we parse the string to dynamic schema dictionary into RDD, however we can't push it to connector without transfer to DataFrame.
I think we need to consulting with Mongo Spark Team, once Mongo Spark can support RDD writing, we can migrate all codes to Python.

Issue History:
1. RDD approach has been deprecated in mongo-hadoop project March 2016.
RDD saveAsNewAPIHadoopFile which used to write data into MongoDB has been deprecated.
rdd.saveAsNewAPIHadoopFile(
path='file:///this-is-unused',
outputFormatClass='com.mongodb.hadoop.MongoOutputFormat',
keyClass='org.apache.hadoop.io.Text',
valueClass='org.apache.hadoop.io.MapWritable',
conf=

{ 'mongo.output.uri': 'mongodb://t2cUserQA:G05hark5@qa-t2c-node1.paradata.io:27017/t2c.JasonpartFromSpark2' }

)
Announced @: https://github.com/mongodb/mongo-hadoop/wiki/Spark-Usage
2. objectID issue: resolved.... when converting to DataFrame found: "TypeError: not supported type: <class 'bson.objectid.ObjectId'>"
tracking by: https://jira.mongodb.org/browse/HADOOP-277
Schema related issues:
3. StructType issue: "com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast ARRAY into a StructType"
tracking by: https://groups.google.com/forum/#!topic/mongodb-user/lQjppYa21mQ
4. repartition issue:
"Cannot cast ARRAY into a StructType(StructField(0,StringType,true), StructField(1,StringType,true), StructField(2,StringType,true), StructField(3,StringType,true), StructField(4,StringType,true)) (value: BsonArray{values=[BsonString

{value='Logic'}

, BsonString

{value='Logic ICs'}

]})"
tracking by: https://groups.google.com/forum/#!topic/mongodb-user/lQjppYa21mQ

Assignee:: Ross Lawley
Reporter:: Chao Zhang
Reviewers:: None
Votes:: 0 Vote for this issue
Watchers:: 3 Start watching this issue

Created:: Oct 10 2017 05:44:54 PM UTC
Updated:: Oct 27 2023 11:54:01 AM UTC
Resolved:: Dec 19 2017 02:35:54 PM UTC

Details

Description

Attachments

Activity

People

Dates