-
Type: Improvement
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: Performance, Schema
-
None
When using the DataFrame API to load a MongoDB collection which contains a field with very dynamic keys, the SchemaInfer step will generate a very large schema which leads to long wait times or OutOfMemory errors.
My suggestion is to detect those fields and turn them into a MapType.
There would be two requirements for detecting a MapType:
1. Key and Value are always of the same or compatible type
2. Over n (probably configurable) keys in the field.
I will try to submit a pull request for this.