On Wed, Sep 28, 2011 at 6:15 AM, Michael Cahill <mjc@wiredtiger.com> wrote:
> In our current design, there is a point during sync where the root page is written, and reads would be blocked while that one operation is in progress. We expect that operation to be fast, but it is difficult to quantify what "fast" means without knowing more about the data and the platform. It should not involve any physical I/O, for example: just in-memory operations. If that sounds like a problem, let us know and we'll discuss alternatives.
I took a look at the reconciliation code this morning, and I'm pretty sure we could makes pages stay active during reconciliation. As you know, right now we lock the page down, re-write its contents to disk, then logically re-read the disk image and create a new in-memory page, if we want to keep the page in-memory after reconciliation (for example, during a sync, as opposed to a close). The change would be to make the reconciliation code just another reader of the page, that is, we can be happily reading, and even writing, the page during reconciliation.
I actually had reconciliation working this way at one point, but I removed it in order to simplify the problem and get rid of some bugs. There are two tricky parts and one design question. Tricky part WT-1 is that underlying objects are a problem. Imagine a page with key X and overflow value Y:
1. The first time we re-write the page, we don't have to do anything, the underlying object doesn't move, and there's just a new reference to it.
2. Then the application deletes X: the next re-write has to delete the underlying overflow value Y.
3. The the application updates X to have a new value, and that value happens to be an overflow item: the next re-write has to create a new overflow object, and so on,
through all the possible stages a page can be in.
My guess is we'll have to add in-memory information for every overflow item that indicates its state, that is, if it's currently on disk or not, and track that through the item's life-cycle. I didn't do that originally, and I can't remember why, looking at it now, it doesn't seem terribly difficult.
Tricky part WT-2 is figuring out if the page is dirty. I don't think that's too hard – we just have to maintain a "disk-version" of the page, that is, the write-generation of the page immediately before we read the page to write its contents to disk.
The design question is cleaning up big skiplists. Reconciliation is how we clean up the in-memory page. So, if we've inserted a large number of items into the page, at some point we need to split up the skiplists into Btree structures, and we probably don't want to maintain lots of information in skiplists over a long period.
Anyway, this change would allow readers and writers to do whatever they want, while the page is being reconciled – we'd only have to lock out readers and writers if we're logically re-reading the page in order to clean up the big skiplists.