Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 3.1.6
Component/s: Text Search
Labels:
- qi-text-search
- query-44-grooming

Assigned Teams:

Query Integration
Backwards Compatibility:
Fully Compatible

When doing a text search with phrase matching in text index v3, all of the phrases in a query are converted from UTF-8 to UTF-32 for every document it checks.

For example, if your search query is " \"hello world\" ", and the initial index scan returns 25000 documents, the string "hello world" will be converted from UTF-8 to UTF-32 25000 times.

This is due to the fact that the existing FTSPhraseMatcher interface is stateless and takes a const std::string& phrase and haystack every time it's used. This was initially not a problem for the non-Unicode phrase matcher since it was not manipulating the input, but now with the Unicode phrase matcher, the phrase and haystack is being converted to a unicode::String.

To fix this, the FTSPhraseMatcher interface should be refactored so that it has state and two functions: setPhrase(const std::string& phrase) and phraseMatches(const std::string& haystack, Options options). This way, the Unicode phrase matcher implementation can convert the phrase from a UTF-8 std::string to a UTF-32 unicode::String just once, and have reusable buffers. The Unicode phrase matcher should also not use unicode::String::substrMatch because it makes excessive copies and allocations. It should implement substring matching itself using the the toBuf methods in unicode::String (In fact, String::substrMatch should be removed entirely after this fix).

Since FTSPhraseMatcher implementations can have state after this change, FTSLanguage's getPhraseMatcher() should also be renamed to createPhraseMatcher() and adopt functionality similar to createTokenizer().

Assignee:: [DO NOT USE] Backlog - Query Integration

Reporter:: Adam Chelminski (Inactive)

Participants:: [DO NOT USE] Backlog - Query Integration, Adam Chelminski

Votes:: 1 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: Aug 14 2015 05:16:30 PM UTC

Updated:: Dec 28 2023 06:28:24 PM UTC

Details

Description

Attachments

Activity

People

Dates