Overview
Currently aggregations can only set a single collation over the entire pipeline. This makes some sense for aggregations that originate from collections, but it’s more problematic for change streams that span multiple collections (as, e.g., mongosync uses). It's quite easy to have a data consistency problem if the client forgets/overlooks that string comparisons in such a change stream are simple-collated, regardless of the respective collections’ default collations.
REP-3312 was such a problem. This prompted a Critical Advisory for mongosync, which led (in part) to the present [Migration & Backup Correctness|INIT-532] initiative, which includes mongosync’s current [collation-fixes epic|REP-3672].
This task proposes to facilitate a fix for this by exposing the server’s internal index key via an aggregation operator, which I’ll tentatively call $_internalIndexKey. This operator would look thus:
{ $_internalIndexKey: { input: "abc", // … but can be any arbitrary BSON value collation: { locale: "en", strength: 1 }, } }
… and would output, as a binary blob, the index key that the server would create for that string & collation.
This will facilitate REP-3312’s fix.
Numeric Types
As a convenience, this also envisions that the $_internalIndexKey operator will normalize numeric types. Thus, mongosync will have an easy way to tell via aggregation that { $numberLong: 42 } and { $numberDouble: 42 } are, in fact, the same number. See comments and linked tickets for context on how this helps us.
Rejected Alternatives
See REP-3672’s (in-progress) technical design for a list of considered alternative solutions.
See SERVER-84198 for an additional request to facilitate full collation support with document filtering in mongosync.
- related to
-
SERVER-84462 Consider offering a way for $toHashedIndexKey expression to apply a collation
- Backlog
- split to
-
SERVER-84198 Facilitate multiple collations within the same change stream.
- Closed