-
Type: Task
-
Resolution: Done
-
Affects Version/s: None
-
Component/s: None
@agorrod, @michaelcahill, @sueloverso:
Here's a writeup of the current situation with prefix compression on leaf pages.
Every key cell on a leaf page has an immediate trailing byte which is the prefix compression byte: if prefix compression is configured off, the byte still appears with a value of 0.
A few points:
- As Alex discovered in internal pages, instantiating a key has a significant cost (ignoring allocation overhead, instantiating a key requires 8B plus the total length of the key).
- The trade-off of instantiating the key vs. reading it from the page is not as straight-forward as it was in internal pages: we only instantiate leaf page keys if the key is at a "strategic position" or we encounter the key as part of a random search.
By "strategic position", I mean the key lives at a binary search choice point. When a leaf page is first read in, we fake a binary search of the page to find the places on the page we'll hit as part of a random search, and instantiate those keys, until we're guaranteed gaps on the page no larger than the configured "key_gap" value.
By "part of a random search", I want to emphasize we don't instantiate keys just because a cursor goes through the page, only if the key is encountered as part of a random search.
(An idea that won't work: it would be great if we could figure out as part of page reconciliation where the btree search choice points are for a page and don't prefix compression keys at those locations, but that doesn't work, by the time we figure out how many keys will live on a page, the page is laid-out, this would at best make reconciliation a two-pass algorithm.)
- Obviously, we're wasting a byte per key if prefix-compression is configured off or prefix-compression isn't possible for two adjacent keys. (Although, the latter is unlikely: the probability we won't compress out at least 1 byte on any two adjacent keys in a large tree is effectively 0, there are only 256 possible values.)
A possible fix is to change things so we flag in each key cell's metadata whether or not there's a prefix-compression byte.
There are two interesting key cell types with respect to prefix compression: WT_CELL_KEY and WT_CELL_KEY_SHORT.
We use WT_CELL_KEY_SHORT when a key is less than 64B in length. In this case, the cell's metadata consists of a single bit, and 6 of the other 7 bits of the byte are the length of the key. (Ignore the 8th bit, it's used for "short data" objects.) WT_CELL_KEY are keys 64B and larger, and they are followed by a packed key length.
WT_CELL_KEY isn't hard to fix: create WT_CELL_KEY_PREFIX, and for keys of type WT_CELL_KEY_PREFIX there's a prefix-compression byte, for WT_CELL_KEY there is no prefix-compression byte, otherwise they're the same and we're done.
WT_CELL_KEY_SHORT is difficult, we don't have any bits to use to encode whether or not it's followed by a prefix-compression byte.
We could:
- Change WT_CELL_KEY_SHORT to only apply to keys 32B or less and use the freed-up bit to flag if the short-key has a prefix-compression byte. That means we're now storing (packed) key lengths for keys between 32B and 64B. (On the other hand, those lengths will fit into a single byte.)
- Change WT_CELL_KEY_SHORT to imply no prefix-compression byte, in other words, sufficiently short keys always appear on the page without a prefix-compression byte. (The downside of this choice is if our prefix-compression shortens the key significantly, to the point where we could store it as short key with a prefix compression byte, we can no longer do that. On data sets with long common prefixes, this choice will hurt.)
- Leave WT_CELL_KEY_SHORT as is, where it always implies a prefix-compression byte, or, perhaps create a hybrid, a short-key with length of more than N bytes has a prefix-compression byte, but one smaller than N bytes does not.
I don't see a reason to prefer any of these schemes, it's pretty easy to imagine a work-load where any one of the choices won't be optimal.
I do think we should add WT_CELL_KEY_PREFIX: we'll save a byte per key when prefix-compression is turned off, and it's a relatively simple change, I don't see any down-side. Note, however, this only helps if the key is large enough, >= 64B.
I don't have a sense of how many applications will turn off prefix-compression. Squirreling around to figure out the key's real value is pretty expensive (and done inside the btree leaf-page search loop!), but that cost is amortized over the number of times that key is accessed.
I can't think of any way to trade-off prefix compression vs. anticipated instantiation cost: that implies some way of knowing the likelihood a key will be instantiated, and that's work-load dependent.
I don't know what to do with WT_CELL_KEY_SHORT – I'm looking for a convincing argument or good idea here.