Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-37715

Use DISTINCT_SCAN for $unwind-$group pipelines

    • Query Optimization
    • Query Optimization 2021-05-17

      This is an extension of SERVER-9507, which allowed $group to be executed using a DISTINCT_SCAN stage. A distinct command over a multikey field can DISTINCT_SCAN a multikey index:

      > db.c.drop()
      true
      > db.c.createIndex({a: 1})
      {
      	"createdCollectionAutomatically" : true,
      	"numIndexesBefore" : 1,
      	"numIndexesAfter" : 2,
      	"ok" : 1
      }
      > db.c.insert({a: [1, 2, 3]})
      WriteResult({ "nInserted" : 1 })
      > db.c.insert({a: [2, 3, 4]})
      WriteResult({ "nInserted" : 1 })
      > db.c.explain().distinct("a")
      {
      	"queryPlanner" : {
      		"plannerVersion" : 1,
      		"namespace" : "test.c",
      		"indexFilterSet" : false,
      		"parsedQuery" : {
      
      		},
      		"queryHash" : "840B4B0E",
      		"winningPlan" : {
      			"stage" : "PROJECTION",
      			"transformBy" : {
      				"_id" : 0,
      				"a" : 1
      			},
      			"inputStage" : {
      				"stage" : "DISTINCT_SCAN",
      				"keyPattern" : {
      					"a" : 1
      				},
      				"indexName" : "a_1",
      				"isMultiKey" : true,
      				"multiKeyPaths" : {
      					"a" : [
      						"a"
      					]
      				},
      				"isUnique" : false,
      				"isSparse" : false,
      				"isPartial" : false,
      				"indexVersion" : 2,
      				"direction" : "forward",
      				"indexBounds" : {
      					"a" : [
      						"[MinKey, MaxKey]"
      					]
      				}
      			}
      		},
      		"rejectedPlans" : [ ]
      	},
      	"serverInfo" : {
      		"host" : "storchbox",
      		"port" : 27017,
      		"version" : "0.0.0",
      		"gitVersion" : "unknown"
      	},
      	"ok" : 1
      }
      

      When the same distinct operation is expressed using the aggregation framework, however, the $unwind inhibits the DISTINCT_SCAN optimization added in SERVER-9507:

      > db.c.explain().aggregate([{$unwind: {path: "$a", preserveNullAndEmptyArrays: true}}, {$group: {_id: "$a"}}])
      {
      	"stages" : [
      		{
      			"$cursor" : {
      				"query" : {
      
      				},
      				"fields" : {
      					"a" : 1,
      					"_id" : 0
      				},
      				"queryPlanner" : {
      					"plannerVersion" : 1,
      					"namespace" : "test.c",
      					"indexFilterSet" : false,
      					"parsedQuery" : {
      
      					},
      					"queryHash" : "8B3D4AB8",
      					"winningPlan" : {
      						"stage" : "COLLSCAN",
      						"direction" : "forward"
      					},
      					"rejectedPlans" : [ ]
      				}
      			}
      		},
      		{
      			"$unwind" : {
      				"path" : "$a",
      				"preserveNullAndEmptyArrays" : true
      			}
      		},
      		{
      			"$group" : {
      				"_id" : "$a"
      			}
      		}
      	],
      	"ok" : 1
      }
      

      This task is to add logic in order to optimize $unwind-$group pipelines to use DISTINCT_SCAN when appropriate. It is closely related to SERVER-27494, which describes how $unwind can sometimes be optimized away even without generating a DISTINCT_SCAN plan.

            Assignee:
            backlog-query-optimization [DO NOT USE] Backlog - Query Optimization
            Reporter:
            david.storch@mongodb.com David Storch
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

              Created:
              Updated: