Mongodump is single threaded and therefor very slow to dump on large databases.
The simplest approach to making mongodump parallelized is to assign one thread per collection (up to some user-defined or adaptively arrived at limit). It would be easy to overload either source or client machine by using multiple threads.
Going further, it may be possible to partition individual collections by range. This could be tricky in sharded clusters.