-
Type: Improvement
-
Resolution: Duplicate
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Sharding
-
None
In a sharded cluster it is known that counts do not return the correct values. If there are orphaned documents then even a count with a query will return an inflated value.
It is possible to get around this by using the (nigh undocumented) itcount() function on a cursor. This is very slow however.
A faster solution would be to use the explain function on a find. The code in explain already exists to provide the correct count and in our tests it provides this count through explain four times faster than itcount(). A much more detailed explanation of the problem and the proposed solution to follow :
I am using our current production system, where this discreptancy was identified, as the example case here. All of these examples are run on the same sharded cluster which is currently in production.
In a sharded collection where there exists orphaned documents a count returns the following
mongos> db["message.blast.20140617"].find({_id : {$regex : "^2749336"}}).count() 1001904
However if you run through the find cursor and count every document you get 999999, this is obviously the correct number. Running an explain on the find cursor gets the following:
mongos> db["message.blast.20140617"].find({_id : {$regex : "^2749336"}}).explain() { "clusteredType" : "ParallelSort", "shards" : { "messageA/HOSTNAME:27017,HOSTNAME:27017" : [ { "cursor" : "BtreeCursor _id_ multi", "isMultiKey" : false, "n" : 344267, "nscannedObjects" : 344267, "nscanned" : 344268, "nscannedObjectsAllPlans" : 344267, "nscannedAllPlans" : 344268, "scanAndOrder" : false, "indexOnly" : false, "nYields" : 1406, "nChunkSkips" : 0, "millis" : 4235, "indexBounds" : { "_id" : [ [ "2749336", "2749337" ], [ /^2749336/, /^2749336/ ] ] }, "server" : "HOSTNAME:27017" } ], "messageB/HOSTNAME:27017,HOSTNAME:27017" : [ { "cursor" : "BtreeCursor _id_ multi", "isMultiKey" : false, "n" : 337542, "nscannedObjects" : 339447, "nscanned" : 339448, "nscannedObjectsAllPlans" : 339447, "nscannedAllPlans" : 339448, "scanAndOrder" : false, "indexOnly" : false, "nYields" : 312, "nChunkSkips" : 1905, "millis" : 2197, "indexBounds" : { "_id" : [ [ "2749336", "2749337" ], [ /^2749336/, /^2749336/ ] ] }, "server" : "HOSTNAME:27017" } ], "messageC/HOSTNAME:27017,HOSTNAME:27017" : [ { "cursor" : "BtreeCursor _id_ multi", "isMultiKey" : false, "n" : 318190, "nscannedObjects" : 318190, "nscanned" : 318191, "nscannedObjectsAllPlans" : 318190, "nscannedAllPlans" : 318191, "scanAndOrder" : false, "indexOnly" : false, "nYields" : 201, "nChunkSkips" : 0, "millis" : 1904, "indexBounds" : { "_id" : [ [ "2749336", "2749337" ], [ /^2749336/, /^2749336/ ] ] }, "server" : "HOSTNAME:27017" } ] }, "cursor" : "BtreeCursor _id_ multi", "n" : 999999, "nChunkSkips" : 1905, "nYields" : 1919, "nscanned" : 1001907, "nscannedAllPlans" : 1001907, "nscannedObjects" : 1001904, "nscannedObjectsAllPlans" : 1001904, "millisShardTotal" : 8336, "millisShardAvg" : 2778, "numQueries" : 3, "numShards" : 3, "millis" : 4236 }
The explain has the results that we want, namely
"n" : 999999 "nChunkSkips" : 1905
n is the correct count, and if one adds the n and the nChunkSkips you get the value that the count replies. The count does not exclude these orphaned documents causing an inflated count. This can be gotten around by using itcount() on the find cursor, this was timed versus how long it takes for an explain to return, as follows:
$ time mongo blast --port 17018 --eval 'db["message.blast.20140617"].find({_id : {$regex : "^2749336"}}).itcount();'
MongoDB shell version: 2.4.8
connecting to: 127.0.0.1:17018/blast
999999
real 0m17.725s
user 0m5.248s
sys 0m0.060s
$ time mongo blast --port 17018 --eval 'db["message.blast.20140617"].find({_id : {$regex : "^2749336"}}).explain().n;'
MongoDB shell version: 2.4.8
connecting to: 127.0.0.1:17018/blast
999999
real 0m3.353s
user 0m0.020s
sys 0m0.008s
while a count currently runs faster than any of these but it returns the wrong value
$ time mongo blast --port 17018 --eval 'db["message.blast.20140617"].find({_id : {$regex : "^2749336"}}).count();'
MongoDB shell version: 2.4.8
connecting to: 127.0.0.1:17018/blast
1001904
real 0m0.952s
user 0m0.024s
sys 0m0.008s
A faster way to implement count on a sharded cluster that would return the correct value is to use the same code that is in explain to return the n and use that to return the count if the count has a query.
Solution : For sharded clusters any count that is implemented with a query should return the same value, in the same method, that explain().n does. This will be faster than iterating through the cursor as well as providing correct results during migrations and in the presance of orphaned documents.
- duplicates
-
SERVER-3645 Sharded collection counts (on primary) can report too many results
- Closed