When reading data that suffers from high network and I/O latency, a common technique to improve throughput is to parallelize the workload.
For mongodump, parallelizing a single collection scan could do the following:
- $sample the collection and record a good number of _id values.
- Sort the _id values and add pairs of [start _id, end _id] into a work queue.
- Have some configurable number of worker threads pop work items, perform range queries and write the results.
There are trade-offs for this algorithm, sequential scans are being replaced with random access lookups, so I wouldn't recommend this being a default.
Additionally, the feature is less valuable when mongodump is already distributing work by scanning multiple collections in parallel. It's not clear to me if/how the existing --numParallelColletions should mix with this.
- is related to
-
TOOLS-3047 Document-level parallelisation for mongorestore
- Accepted