-
Type: Improvement
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: 2.6.8
-
Component/s: Text Search
-
Query Integration
This might be a non-trivial request, in which I understand it might not be implemented soon.
Based on customer feedback and googling some forums, a common "search engine experience" that users often want from full text search is the following query:
db.ftscollection.find( { $text : { $search : "John Smith New York" } }, { score : { $meta : "textScore" } } ) .sort( { score : { $meta : "textScore" } } ) .limit(10)
Ie the user just wants to search for a bunch of words and is hoping for the weighted scores to efficiently come up with a top ten result set. In reality MongoDB will proceed to scan all matching entries from the FTS index, then sort them.
I'm not familiar with the index structure used by our FTS. It seems to me if the score for each indexed word is included in the index (and index is sorted by that) then this would be possible.
For example, say that an FTS index contains mens and womens names with different weighted scores:
{ '' : 'Alice', score : 3 } { '' : 'Alice', score : 2 } { '' : 'Alice', score : 1 } { '' : 'Ben', score : 5 } { '' : 'John', score : 6 } { '' : 'John', score : 4 } { '' : 'John', score : 3 } { '' : 'John', score : 3 } { '' : 'John', score : 2 } { '' : 'John', score : 1 }
Assume now that a user is executing this query on a collection with the above index:
db.ftscollection.find( { $text : { $search : "John Alice" } }, { score : { $meta : "textScore" } } ) .sort( { score : { $meta : "textScore" } } ) .limit(2)
I'm assuming that the total score is just a sum of the scores for each word (which is already weighted in the index).
A greedy index scan would now start by scanning for "John", which has a higher top score compared to Alice. Say that it finds 2 documents where the total score is (for John+Alice): 6+0=6 and 4+3=7. By looking at the next index entries we see that the remaining highest score for John is 3 and for Alice is 2. This means that the highest possible score we could still expect to find is 3+2=5, which is < 6. Therefore the index scan can be terminated and sorted results returned to client.
The statistics for this query would have been:
n = 2 nscanned = 5 nscannedObjects = 2
- is related to
-
SERVER-22528 Allow limit to be pushed down to text stage in query plans
- Backlog