-
Type: New Feature
-
Resolution: Done
-
Priority: Major - P3
-
Affects Version/s: 1.6.3
-
Component/s: Index Maintenance, Querying
-
Query
-
Fully Compatible
-
(copied to CRM)
ISSUE SUMMARY
Version 3.3.11 of MongoDB introduces support for unicode-aware string comparisons, allowing users to issue queries that sort and match UTF-8 encoded string data in a locale-aware fashion. The server will accept a collation document specifying the locale, amongst other properties of the string comparator, such as diacritic sensitivity and case sensitivity. The collation can be attached at the operation level to a particular query. Alternatively, a default collation can be specified at collection creation time which will be used by all operations over the collection.
TECHNICAL DETAILS
Syntax for specifying a collation
The collation is specified with a document of the following form:
collation: { locale: <string>, caseLevel: <bool>, caseFirst: <string>, strength: <int>, numericOrdering: <bool>, alternate: <string>, maxVariable: <string>, normalization: <bool>, backwards: <bool> }
All fields are optional, except for the locale field, which is required. The list of supported locales as well as documentation of all collation options is available here: Development Series 3.3.x Collation.
Supported operations
A collation can be attached at the operation level to the following commands:
- aggregate
- count
- distinct
- find
- findAndModify
- geoNear
- group
- mapReduce
- remove
- update
If the collation is omitted, then the collection's default collation will be used.
An operation with a collation will use the collation for all string comparisons of stored data. If, for example, an aggregation is issued with a $match stage followed by a $sort stage with the diacritic-insensitive French collation, then the server will apply the diacritic-insensitive French semantics to both the match and the sort.
Index support
A collation can also be associated with an index at index creation time. Indexes with a collation can support string matching and string sorting operations if the collation associated with the index is identical to the index associated with the query. The following index types accept a collation at index build time:
- btree
- 2dsphere
Index builds issued against a collection with a default collation will inherit the collection default unless an overriding collation is specified explicitly on the createIndex command.
Example
The following example demonstrates how to use the mongo shell to sort strings using French Canadian comparison rules:
> db.myColl.insert([{_id: 1, "term": "cote"}, {_id: 2, "term": "coté"}, {_id: 3, "term" : "côte"}, {_id: 4, "term" : "côté"}]); > db.myColl.find().sort({"term": -1}).collation({"locale": "fr_CA"}); { "_id" : 4, "term" : "côté" } { "_id" : 2, "term" : "coté" } { "_id" : 3, "term" : "côte" } { "_id" : 1, "term" : "cote" }
Note that the order in which the result set is sorted would be different without the .collation() modifier, as the fr_CA locale includes the backwards option by default, enabling special French comparison rules for diacritical marks.
More details
For more thorough technical documentation, please refer to the documentation.
IMPACT ON DOWNGRADE
Downgrade from 3.4 to 3.2 is illegal if the data files contain any collections or indices with a collation. Before downgrading, all collections and indices with an associated collation must be dropped.
FURTHER INFORMATION
Documentation for this feature is available in the 3.3.x development series release notes. To join our beta program for Collation Support in MongoDB, and suggest improvements to our implementation, please email beta@mongodb.com.
Original description
I need to properly mongodb sorting characters that are in the wrong order when sorting in utf-8. MySQL has an option to "collation" by which we can set that properly were also ordered list of results by the Polish characters, eg: by utf8_polish_ci
- is depended on by
-
DRIVERS-291 Support providing collation per operation
- Closed
-
SERVER-90 case insensitive index
- Closed
- is related to
-
SERVER-9367 toLowerCase() function does not work for Turkish char "İ"
- Closed
- related to
-
CXX-290 Problem with Query & hint (const string &jsonKeyPatt) with compound index in locale with comma as decimal point
- Closed