-
Type: Improvement
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: 3.1.6
-
Component/s: Text Search
-
Query Integration
-
Fully Compatible
When doing a text search with phrase matching in text index v3, all of the phrases in a query are converted from UTF-8 to UTF-32 for every document it checks.
For example, if your search query is " \"hello world\" ", and the initial index scan returns 25000 documents, the string "hello world" will be converted from UTF-8 to UTF-32 25000 times.
This is due to the fact that the existing FTSPhraseMatcher interface is stateless and takes a const std::string& phrase and haystack every time it's used. This was initially not a problem for the non-Unicode phrase matcher since it was not manipulating the input, but now with the Unicode phrase matcher, the phrase and haystack is being converted to a unicode::String.
To fix this, the FTSPhraseMatcher interface should be refactored so that it has state and two functions: setPhrase(const std::string& phrase) and phraseMatches(const std::string& haystack, Options options). This way, the Unicode phrase matcher implementation can convert the phrase from a UTF-8 std::string to a UTF-32 unicode::String just once, and have reusable buffers. The Unicode phrase matcher should also not use unicode::String::substrMatch because it makes excessive copies and allocations. It should implement substring matching itself using the the toBuf methods in unicode::String (In fact, String::substrMatch should be removed entirely after this fix).
Since FTSPhraseMatcher implementations can have state after this change, FTSLanguage's getPhraseMatcher() should also be renamed to createPhraseMatcher() and adopt functionality similar to createTokenizer().