ISSUE SUMMARY
The mongoS routing of scatter/gather operations iterates over the boundaries for the collection’s chunks sorted in ascending order. As the list is iterated, shards that own a chunk for the collection are discovered. If all shards in the cluster that own a chunk for the collection are discovered partway through the iteration, the iteration early-exits and the mongoS targets all of these shards.
This logic is suboptimal if a large number of the sorted chunks are initially owned by a subset of shards. This can result in a large number of iterations needed to discover all shards. As an example, on a 2-shard cluster with a collection where the first 100k sorted chunks are owned on shard 1, the mongoS iterates over the boundaries for 100k chunks when routing a scatter/gather operation.
USER IMPACT
On a sharded collection where a large number of the sorted chunks (e.g. 100k) are initially owned by a subset of shards, a scatter/gather operation can display increased latency (e.g. milliseconds) or increased mongoS CPU utilization.
This chunk distribution can result from adding a shard to the cluster, as contiguous low chunk ranges get migrated into the new shard.
WORKAROUND
Rearrange the chunks in a collection so that the initial chunks in the sorted list discover all the shards. The script below can be used to verify the number of iterations required to route a scatter/gather operation.
var ns = 'mydb.mycoll' var nShards = db.getSiblingDB('config').shards.count(); var count = 0; var shards = []; print('Iterating ' + ns + ' chunks sorted in ascending order...') var cursor = db.getSiblingDB('config').chunks.aggregate([{$match : {ns: ns}}, {$sort : { min : 1 }}]); while (cursor.hasNext() && shards.length != nShards) { let nextShard = cursor.next().shard; if (!shards.includes(nextShard)) { shards.push(nextShard); print(" discovered shard " + nextShard + " at chunk iteration " + count) }; ++count; } print('Done. ' + count + ' chunk iterations to route a scatter/gather operation.')
- is duplicated by
-
SERVER-47222 Mongos high cpu usage on getShardIdsForRange while dealing shard key range query
- Closed