There's a reference to stream compression in Abadi's paper "Integrating Compression and Execution in Column-Oriented Database Systems", where they say:
4.5 Heavyweight Compression Schemes Lempel-Ziv Encoding. Lempel-Ziv ([35, 36]) compression is the most widely used technique for lossless file compression. This is the algorithm upon which the UNIX command gzip is based. Lempel-Ziv takes variable sized patterns and replaces them with fixed length codes. This is in contrast to Huffman encoding which produces variable sized codes. Lempel-Ziv encoding does not require knowledge about pattern frequencies in advance; it builds the pattern table dynamically as it encodes the data. The basic idea is to parse the input sequence into non-overlapping blocks of different lengths while constructing a dictionary of blocks seen thus far. Subsequent appearances of these blocks are replaced by a pointer to an earlier occurrence of the same block. We refer the reader to [35, 36] for more details. For our experiments, we used a freely available version of the Lempel-Ziv algorithm [2] that is optimized for decompression performance (we found it to be much faster than UNIX gzip). We experimented with several other heavyweight compression schemes, including Huffman and Arithmetic encoding, but found that their decompression costs were prohibitively expensive for use inside of a database system.
Reference [2] is to the LZOP web page at [http://www.lzop.org/].
That page has references to the Linux Journal's "Compression Tools Compared" http://www.linuxjournal.com/node/8051/print and "Huge Unix File Compresser Shootout with Tons of Data/Graphs" http://aliver.wordpress.com/2010/06/22/huge-unix-file-compresser-shootout-with-tons-of-datagraphs/.
I think the tuning question is if we want to better compression in our "default" engine, vs. speed of compression vs. speed of decompression.
In other words, are we using stream compression primarily to decrease our on-disk footprint and make any I/O more useful, or do we anticipate serving data sets that normally require disk I/O, and so decompression & compression speeds are important to us? (It's worth nothing that in a read-mostly workload, we'd do far more decompression than compression, and so decompression speeds would be more important that compression speeds.)
There's also the point that we're going to be focused on benchmarks for awhile, and perhaps we should assume a typical benchmark, which does read from disk, is part of our design criteria.
Looking at the "Huge Unix File Compresser Shootout with Tons of Data/Graphs", I am inclined to go with lzop (relatively good compression with really fast compression/decompression), or lzma (the best compression on binary data, with slow run-times).
Depending on the difficulty of plugging in these engines, we could probably offer more than one by default? We could have a recommended fast compression/decompression engine, and a small-footprint engine, and you pick the one you like? (I'm saying that because I'm guessing building and linking these engines behind WiredTiger isn't going to be a big deal for Don.)
Michael, can you please weigh in on this?