Uploaded image for project: 'Spark Connector'
  1. Spark Connector
  2. SPARK-136

$limit stage produces more documents than expected

    • Type: Icon: Bug Bug
    • Resolution: Works as Designed
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 2.2.0
    • Component/s: API
    • None
    • Environment:

      Case description

      When using mogodb spark connector, sometimes incorrect results are returned by the aggregation pipeline on $limit state, regardless of proceeding stages: the number of returned documents exceeds the $limit value. This behavior was only noticed on relatively long collections with "fat" documents.

      Code example

      Please, consider the code snippet below that consistently reproduces the issue for the single document limit ($limit:1).

      spark.createDataFrame([(i, [k for k in range(100)]) for i in range(100000)], ["seq", "data"]) \
          .write.format("com.mongodb.spark.sql.DefaultSource") \
          .mode("overwrite").save()
      
      test2 = spark.read.format("com.mongodb.spark.sql.DefaultSource") \
          .option("sampleSize", 100) \
          .option("pipeline", [{'$limit': 1}]) \
          .load()
      
      print('Test 2: Expected 1 row, got', test2.count(), 'rows:')
      test2.show()
      

      The following output is produced by the code:

      Test 2: Expected 1 row, got 3 rows:
      +--------------------+--------------------+-----+
      |                 _id|                data|  seq|
      +--------------------+--------------------+-----+
      |[598ceed7a751cc6b...|[0, 1, 2, 3, 4, 5...|    0|
      |[598ceed9a751cc6b...|[0, 1, 2, 3, 4, 5...|31697|
      |[598ceedaa751cc6b...|[0, 1, 2, 3, 4, 5...|66686|
      +--------------------+--------------------+-----+
      

      Obviously, the DataFrame is expected to have a single row.

      Please, consider the complete working example attached as app.py.

      The example consists of two tests: one for small dataset (than passes), and one for largr dataset (that fails). Both tests execute queries against the same schema, only the number of documents is different.

      The example was launched locally in a standalone mode using the command:

      spark-submit --packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 app.py
      

            Assignee:
            ross@mongodb.com Ross Lawley
            Reporter:
            m0nzderr Illya Kokshenev
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: