-
Type: Bug
-
Resolution: Done
-
Priority: Major - P3
-
Affects Version/s: 3.2.0-rc4
-
Component/s: Text Search
-
None
-
Fully Compatible
-
ALL
-
-
QuInt D (12/14/15)
MongoDB 3.2.0 RC4 appears to have a substantial performance regression with full text searching
Test Data
3000 books obtained from Project Gutenberg (http://web.eecs.umich.edu/~lahiri/gutenberg_dataset.html) stored in MongoDB as follows:
{ author : "Abraham Lincoln", title : "Letters", body : "<full text of book>" }
This data was then indexed using an "all fields" index:
db.books.createIndex( { "$**" : "text" } );
This produces a test dataset of around 1.1GB with a text index of 155MB (measured with WT)
Test process
This data was processed into different versions of MongoDB and various simple searches were run using words and phrases of different occurence frequencies in the dataset. This was done using the following, simple query shape in an aggregation pipeline, with the ultimate goal being to report the number of books per author containing the search word:
db.books.aggregate([ { $match : { $text : { $search : "house" } } }, { $group : { _id : { author : "$author" }, count : { $sum : 1 } } }, { $sort : { count : -1 } } ]);
The words used are as follows:
- slaveholder
- hound
- "gigantic hound"
- cheese
- house
A simple test script ("testQuery_all.js") is attached to automate this process.
Test results
All of these results were taken at the third run (i.e. to ensure that data was as warm as possible). In the case of the 3.2 results, mongod ran one core flat-out for the entire query duration.
Version | Engine | Total Query Duration (ms) |
---|---|---|
2.6.11 | MMAPv1 | 5308 |
3.0.7 | MMAPv1 | 5306 |
3.0.7 | WT Snappy | 6625 |
3.2.0 RC4 | MMAPv1 | 26157 |
3.2.0 RC4 | WT Snappy | 639862 |
Full results are available here:
https://goo.gl/s4pU9j
Source data is here:
https://dl.dropboxusercontent.com/u/6076108/books.bson.gz
Note: text index needs to be manually applied to this data:
db.books.createIndex( { "$**" : "text" } );
- is related to
-
SERVER-19936 Performance pass on unicode-aware text processing logic (text index v3)
- Closed