-
Type: Improvement
-
Resolution: Fixed
-
Priority: Major - P3
-
None
-
Component/s: Collection Management
We're planning to add new fields to command responses and change output of listIndexes. Scope will have more details.
Description of Linked Ticket
Summary
Without clustering, a collection is stored in a B-Tree by a RecordId that is not exposed to end users, and there is a primary key index (<primary key>, <RecordId>). With clustering, a collection is to be stored in a B-Tree by the collection’s primary key, and there is no primary key index. This project is a generalization of clustering for time series (PM-288), and will need to support upgrading existing collections to use clustering.
Motivation
Clustering by primary key is important for fast scale in/out in Serverless. This is largely because split and merge, which will do a physical copy such as file copy, will replace tenant migration/chunk migration, which does a logical copy.
- If a tenant does not have local secondary indexes (e.g., only has global indexes), orphan cleanup can be done using truncate rather than individual document deletes. Orphan filtering is expensive, so fast orphan cleanup is particularly important when doing a physical copy. This is because with a logical copy, the recipient can only end up with orphans in the range being transferred, but with a physical copy, the recipient can end up with orphans outside the range being transferred (i.e., more orphans). Orphans also block the merge of two slices that were split from each other, since merge has to be on disjoint ranges.
- WT data tables for disjoint primary key ranges can be presented as a single table in constant time, for example by adding a root node above the two tables. This can significantly speed up merge, especially if combined with providing a union-view over any local secondary index tables. The tables can actually be merged into one file in the background.
General benefits of clustering include:
- Faster lookup and range scans by primary key because you don't need to go through the primary key index.
- Faster orphan filtering for covered local index queries because local index entries contain the primary key.
One downside is clustering may consume more space if there are local secondary indexes, since the primary key index reduces the number of copies of each primary key value
Cast of Characters
- Product Owner: michael.gargiulo
- Project Lead:
- Program Manager: connie.chen
- Drivers Contact:
Documentation
Product Description
Scope Document
Technical Design Document
- is depended on by
-
MONGOSH-1172 Support creating collections with clustered indexes.
- Closed
-
VSCODE-330 Include the clusteredIndex option in the collection creation template.
- Closed
- related to
-
DRIVERS-2325 Add commandStartedEvent assertions to clustered index spec tests
- Implementing
- split to
-
JAVA-4576 Clustered Indexes for all Collections
- Closed
-
PHPLIB-843 Clustered Indexes for all Collections
- Closed
-
RUST-1271 Clustered Indexes for all Collections
- Closed
-
CDRIVER-4359 Clustered Indexes for all Collections
- Closed
-
CSHARP-4141 Clustered Indexes for all Collections
- Closed
-
CXX-2491 Clustered Indexes for all Collections
- Closed
-
GODRIVER-2383 Clustered Indexes for all Collections
- Closed
-
MOTOR-935 Clustered Indexes for all Collections
- Closed
-
NODE-4189 Clustered Indexes for all Collections
- Closed
-
PYTHON-3227 Clustered Indexes for all Collections
- Closed
-
RUBY-2959 Clustered Indexes for all Collections
- Closed