We allocate a 16 MB buffer to hold the results of every "distinct" command, no matter how big or small the result is. This can create a performance problem: every buffer is allocated directly from and returned directly to the tcmalloc pageheap, which is moderately expensive, and is a serial bottleneck.
Also, tcmalloc decommits memory (returns it to the o/s) at a rate more-or-less proportional to the rate at which memory is returned to the pageheap, so this high rate of memory returned to the pageheap can create a high rate of decommitting and the re-commiting memory, which is a very expensive operation.
We could consider pre-allocating a much smaller buffer here and allowing it to grow as needed.