-
Type: New Feature
-
Resolution: Done
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: Aggregation Framework
-
Fully Compatible
-
(copied to CRM)
FEATURE DESCRIPTION
This feature adds three new expressions $regexFind, $regexFindAll and $regexMatch to the aggregation language. The $regexFind and $regexFindAll expressions allows regex matching and capturing. $regexMatch is a syntactic sugar on top of $regexFind which can be used for regex matching.
VERSIONS
This feature is available in the 4.1.11 and newer development versions of MongoDB, and in the 4.2 and newer production releases.
RATIONALE
Regex search is a powerful feature of the match language, but does not exist within the aggregation framework. This would unlock many use cases of string manipulation, and bring the two languages closer together. MongoDB Stitch would also be able to leverage this expression to allow users to define visibility rules using regular expressions.
OPERATION
Syntax
Input
{$regexFind:{ // returns the first match found input: <expression>, regex: <expression>, options: <expression> // optional }} {$regexFindAll:{ // returns every match input: <expression>, regex: <expression>, options: <expression> // optional }} {$regexMatch:{ // returns true/false input: <expression>, regex: <expression>, options: <expression> // optional }}
input: string, or expression evaluating to a string
regex: /pattern/opts, or "string pattern", or expression resolving to a regex type. Does not support the extended json regex syntax of {$regex: <string>, $options: <options>}.
options: “imsx”, or expression resolving to a string
Note that this syntax is different from the syntax used to specify regexes and options elsewhere in the server. The $regex match expression may take the form {$regex: <pattern>, $options: <options>}. The important difference is that we are hoisting the ‘regex’ and ‘options’ field into the top-level object. This lets us avoid repeating “regex” twice, (e.g. {input: “x”, regex: {$regex: “xyz”, $options: “123”)}}. Here are some examples:
{$regexFind: {input:"$text", regex: /pattern/opts} {$regexMatch: {input:"hello world", regex: "$pathToRegexField"}} {$regexFindAll: {input:"$text", regex: "pattern", options: “mi”}}
options includes all the regex options currently supported in the match language:
'i' - case insensitive
'm' - newlines match ^ and $
'x' - extended mode (allows for comments, ignores whitespace in the regex, etc.)
's' - allows . to include newline characters
Output
$regexFind will return a single document with the format below, for the leftmost substring in input which matches the regex. If no such substring exists, it will return null. $regexFindAll will return an array of documents (one for each substring in input which matches the regex), each of which have the same format as below. If no matches are found, an empty array will be returned.
$regexFind
{ match: <string> captures: [<string>, <string>, ...] idx: <non-negative integer> }
$regexFindAll
[{ match: <string> captures: [<string>, <string>, ...] idx: <non-negative integer> }, ...]
match: the string that the pattern matched.
captures: an array of substrings within the match captured by parenthesis in the regex pattern, ordered by appearance of the parentheses from left to right. This is an empty array if there were no captures.
idx: a zero-based index indicating where the first char of the match appears in the text field being searched. Represents a code point (not a byte offset).
We will also provide an alias for checking whether any substring matches a regex $regexMatch
$regexMatch is sugar for
{$ne: [ {$regexFind: { <arguments> } }, null ] }
This expression won’t be collation aware, so string comparisons implied by the regex will not match the collation (for example if a collection has a case-insensitive collation, the regex will not “automatically” perform a case-insensitive comparison).
Examples
Basic search with captures
Collection
{_id: 0, text:"Simple example"}
Pipeline
db.coll.aggregate([{
$project: {
matches: {
$regexFindAll: {
input: "$text",
regex: “(m(p))”,
}
}
}
}])
Output
{ _id: 0, matches: [ { match: "mp", captures: ["mp", "p"], idx: 2 }, { match: "mp", captures: ["mp", "p"], idx: 10 } ] }
Email extraction
Collection
{_id: 0, text:"Some field text with email norberto@mongodb.com"}
Pipeline
db.coll.aggregate([{
$project: {
match: {
$regexFind: {
input: "$text",
regex: /([a-zA-Z0-9._-]+)@[a-zA-Z0-9._-]+\.[a-zA-Z0-9._-]+/
}
}
}
}])
Output
{ _id: 0, match: { match: "norberto@mongodb.com", captures: ["norberto"], idx: 27 } }
No matches ($regexFind)
Collection
{_id: 0, text: "Some text with no matches"}
Pipeline
db.coll.aggregate([{
$project: {
match: {
$regexFind: {
input: "$text",
regex:/not present/
}
}
}
}])
Output
{_id: 0, match: null}
No matches ($regexFindAll)
Collection
{_id: 0, text: "Some text with no matches"}
Pipeline
db.coll.aggregate([{
$project: {
matches: {
$regexFindAll: {
input: "$text",
regex:/not present/
}
}
}
}])
Output
{_id: 0, matches: []}
Using regex stored in the document
Collection
{_id: 0, text: "text with 02 digits", regexField: /[0-9]+/}
Pipeline
db.coll.aggregate([{ $project: { match: { $regexFind: { input: "$text", regex: "$regexField", } } } }])
Output
{_id: 0, match: {match: "02", captures: [], idx: 10}}
Using $regexMatch in a $cond
Collection
{_id: 0, phoneNumber: "212-456-7890"} {_id: 1, phoneNumber: "1-800-212-000"}
Pipeline
db.coll.aggregate([{ $project: { region: { $cond: { if: { $regexMatch: { input: “$phoneNumber”, regex: “^212.*$”, } } then: "New York", else: "Somewhere Else" } } } }])
Output
{_id: 0, region: “New York”} {_id: 1, region: “Somewhere Else”}
Non-overlapping captures
Input
{_id: 0, text:"aaaaa"}
Pipeline
db.coll.aggregate([{
$project: {
matches: {
$regexFindAll: {
input: "$text",
regex: “(a*)”,
}
}
}
}])
Output
{
_id: 0,
matches: [
{
match: "aaaaa",
captures: [“aaaaa”],
idx: 0
},
]
}
The purpose of the above example is to show that after a capture is found the search for the next capture will start at the end of the last one (e.g. instead of returning a capture for “a”, “aa”, “aaa” a single capture for “aaaaa” is returned). This matches the behavior provided by python, javascript and other languages. If other behavior is required, the non-greedy ? operator can be used, e.g. /(a+?)/.
- is duplicated by
-
SERVER-13902 Reverse regex functionality for queries
- Closed
-
SERVER-32470 Support for $regex operator in $filter of aggregation pipeline.
- Closed
-
SERVER-34122 Pattern Search Support in $filter & $cond operator
- Closed
-
SERVER-9159 Use Regex capture groups with projections
- Closed
-
SERVER-8892 Use $regex as the expression in a $cond
- Closed
- is related to
-
SERVER-36261 Support field projection based on string inside of field name
- Backlog
-
SERVER-33389 aggregate function similar to regex replace
- Closed
- related to
-
SERVER-39694 Implement $regexMatch as syntactic sugar on top of $regexFind
- Closed
-
SERVER-9156 Projection by a substring match
- Closed
-
SERVER-13902 Reverse regex functionality for queries
- Closed
-
SERVER-22104 $instr function to locate position of a "pattern" within a "string"
- Closed
-
SERVER-8951 Add $findChar or $indexOf operator for strings to find position of specific character (or substring)
- Closed
-
SERVER-39695 Implement $regexFind
- Closed
-
SERVER-39696 Implement $regexFindAll
- Closed
- links to