Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: Sharding
Labels:
None

Assigned Teams:

Query Optimization
Case:
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

As described in the MongoDB Manual (http://docs.mongodb.org/manual/core/aggregation-pipeline-sharded-collections/), from 2.6 onwards aggregation pipelines are executed as follows in a sharded cluster:

When operating on a sharded collection, the aggregation pipeline is split into two parts. The first pipeline runs on each shard, or if an early $match can exclude shards through the use of the shard key in the predicate, the pipeline runs on only the relevant shards.

The second pipeline consists of the remaining pipeline stages and runs on the primary shard. The primary shard merges the cursors from the other shards and runs the second pipeline on these results. The primary shard forwards the final results to the mongos.

For long-running aggregation queries that aggregate a lot of data, the second part of the pipeline (running on the primary shard) is a bottleneck to performance where access to data is sufficiently fast that the query becomes bound by CPU, not I/O. The second part of the pipeline runs in a single thread on the primary shard.

To improve performance for such long-running CPU-bound aggregations, it would be good to add multiple levels of merging such that the 'merger' role can be distributed to multiple shards. For example, in a sharded cluster with 16 shards, have 4 'first level' mergers each of which are responsible for merging the results from 4 shards, then a 'second level' merger which merges the results from the first level mergers and returns the result to mongos.

is duplicated by

SERVER-14985 Merge stages in aggregation should be distributed beyond primary shard

Closed

is related to

SERVER-18925 Merging part of aggregation pipeline should be performed on a random shard to distribute the load

Closed

Assignee:: [DO NOT USE] Backlog - Query Optimization
Reporter:: Jon Rangel (Inactive)
Participants:: [DO NOT USE] Backlog - Query Optimization, Andy Schwerin, Jon Rangel, Kaloian Manassiev
Votes:: 6 Vote for this issue
Watchers:: 25 Start watching this issue

Created:: Mar 25 2015 07:00:42 PM UTC
Updated:: Dec 06 2022 04:54:05 AM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates