Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Done
Fix Version/s: WT1.0
Affects Version/s: None
Component/s: None
Labels:
- Bug

We need to review how we define page sizes in WiredTiger, and write some documentation on it for tuning.

Here's the situation right now – you can specify 5 things:

an initial page size (for both leaf & internal pages),
a maximum page size (for both leaf & internal pages), and
an allocation unit, that is, the block size we use for allocating from the underlying file.

Here's what these 5 sizes do:

The allocation unit is the smallest piece we'll allocate from the underlying file. So, when you pick an allocation unit, you're saying how big the file can get, and you're deciding how much space gets wasted, on average, for an overflow item. The default is a 512B allocation unit, and since we use 32-bit block offsets, you can create a file up to 2TB (2^9 * 2^32). The maximum allocation unit is 512MB, which allows files up to 2EB. Obviously, if you have 512MB allocation units, overflow items could waste a big chunk of space (if you ever had one).
The maximum page sizes tell us when we're going to split during page reconciliation. When an in-memory page is reconciled, we allocate a maximum page size chunk of memory, and then we reconcile the in-memory page into it. When we fill that maximum page size, it causes a split of the page, and a new page is inserted into the tree.
The minimum page size is largely unused, the only thing we use it for is to figure out the overflow size. It's easier to show you that code than to explain it:

        /*
         * Internal pages are also usually small, we want it to fit into the
         * L1 cache.   We try and put at least 40 keys on each internal page
         * (40 because that results in 100M keys in a level 5 Btree).  But,
         * if it's a small page, push anything bigger than about 50 bytes
         * off-page.   Here's the table:
         *      Pagesize        Largest key retained on-page:
         *      512B             50 bytes
         *      1K               50 bytes
         *      2K               51 bytes
         *      4K              102 bytes
         *      8K              204 bytes
         * and so on, roughly doubling for each power-of-two.
         */
        btree->intlitemsize = btree->intlmin <= 1024 ? 50 : btree->intlmin / 40;

        /*
         * Leaf pages are larger to amortize I/O across a large chunk of the
         * data space.  We only require 20 key/data pairs fit onto a leaf page.
         * Again, if it's a small page, push anything bigger than about 80
         * bytes off-page.  Here's the table:
         *      Pagesize        Largest key or data item retained on-page:
         *      512B             80 bytes
         *       1K              80 bytes
         *       2K              80 bytes
         *       4K              80 bytes
         *       8K             204 bytes
         *      16K             409 bytes
         * and so on, roughly doubling for each power-of-two.
         */
        btree->leafitemsize = btree->leafmin <= 4096 ? 80 : btree->leafmin / 20;

In other words, we take the minimum page sizes, hit them with a guess at how deep we want a tree to go, and that determines the overflow size.

I've been thinking it would be better to replace the minimum page sizes with explicit overflow sizes. I think that will be easier to talk about and understand for tuning purposes.

In that design, here are the 5 knobs and what they mean:

1. allocation unit: the unit of allocation from the file; if you keep it small, the maximum file size is limited, but you don't waste as much room on overflow items (unchanged from before)
2. maximum leaf page size: the size at which we split leaf pages, that is, no leaf page grows larger than this (unchanged)
3. maximum internal page size: the size at which we split internal pages, that is, no internal page grows larger than this (unchanged)
4. internal overflow size: any key that's larger than this size gets stored as an overflow item
5. leaf overflow size: any key or data item that's larger than this size gets stored as an overflow item

We'll leave the code that figures out an overflow item size as it currently is, if the application doesn't specify an overflow item size, then that will give us a number to use.

Assignee:: Keith Bostic (Inactive)

Reporter:: WiredTiger

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Created:: Jan 31 2012 05:29:43 PM UTC

Updated:: Apr 16 2015 06:31:42 PM UTC

Resolved:: Apr 09 2015 01:06:15 AM UTC

Details

Description

Attachments

Activity

People

Dates