Uploaded image for project: 'Spark Connector'
  1. Spark Connector
  2. SPARK-133

Detect MapType in Schema Infer Step

    • Type: Icon: Improvement Improvement
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 2.1.3, 2.2.4, 2.3.0
    • Affects Version/s: None
    • Component/s: Performance, Schema
    • None

      When using the DataFrame API to load a MongoDB collection which contains a field with very dynamic keys, the SchemaInfer step will generate a very large schema which leads to long wait times or OutOfMemory errors.

      My suggestion is to detect those fields and turn them into a MapType.
      There would be two requirements for detecting a MapType:
      1. Key and Value are always of the same or compatible type
      2. Over n (probably configurable) keys in the field.

      I will try to submit a pull request for this.

            Assignee:
            ross@mongodb.com Ross Lawley
            Reporter:
            jniebuhr Jochen Niebuhr
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: