-
Type: Improvement
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: None
-
5
-
Asparagus-StorEng - 2023-10-31
Recently as part of WT-11652 code was added to the __col_append_serial_func which included an __wt_atomic_cas64. I'd like to follow up and understand why that change was necessary, since the serial functions are the place in WiredTiger where insertions into a tree are protected by a lock, and so we want to ensure that lock is held for the least amount of time necessary.
The motivation for the change was:
The trouble here is that the btree could be accessed or updated by multiple threads (it's a property of the dhandle), so concurrent appends could leave it in a bad state
I believe there are other mechanisms in place that should prevent a problem with updating the last_recno. It is a combination of characteristics of the column store and the btree implementation.
For column stores - an append operation by definition is inserting a record to the right-most page in a tree. The key (recno) is assigned during the append operation itself as part of the serial function. If there are multiple threads appending to the tree at the same time, they will race to add an entry, but they should all be trying to append to the same page (otherwise they will not be inserting to the end of the tree, and we would see content inserted out-of-order because the last_recno defines the key, and is shared across all threads inserting, if an insertion happens on not-the-right-most page using the highest last_recno it would have a key larger than entries that should sort higher). We can explore the mechanisms by which the requirement holds true, but it's probably too much depth for the description here.
The next piece of the puzzle is that the functions in serial_inline.h are the critical path for adding content into trees. They take a lock on the page being updated to co-ordinate amongst concurrent insertions, which means that code inside the serial functions can generally assume that no other thread will be changing shared (page based) content.
The above two constraints give us:
- A pre-condition that all threads concurrently inserting into a column store will be adding their value to the same (right-most) page in a tree.
- Any code inside a serial function will be protected by a lock from other threads.
So, while the analysis that the btree->last_recno is shared amongst all access to the btree is correct, since column store appends are always to the same page, updates to the last_recno are protected by the WT_PAGE_LOCK that is acquired before entering col_append_serial_func.
I don't think the atomic_cas is necessary.
If it is necessary, the pattern we would generally use in such a critical performance path is to setup everything prior to acquiring the lock, attempt a CAS to put those changes into place and if it fails, release the lock and set everything up to try again. Spinning on a CAS while holding a lock is best avoided.
cc will.korteland@mongodb.com, ruby.chen@mongodb.com, luke.pearson@mongodb.com