Saw a core dump last night, when running against the WT-638 branch – it might be connected to that branch, but I don't think so.
(gdb) where #0 0x000000000044dc67 in __bit_set ( bitf=0x8044222a8 "wivk??e?WoVn???}\226???\227Uf_?gz\233k?\233?ۦ宿g????f?{zw????m?\236?\231?yZo?\225???ݷiyj??i?v~??U?V????f\233\236\237?i??i???ݥ\225\225w?w?u?\237???k?vo?o??ޯ??\236?W??W??^Y??\237??\225?\177[\177\235?ݾy???w??[knZ??^?ߺ]\231\233{\225\231????[g~\231?o~u}??ݺ?\226?????????"..., bit=18446744073709551612) at bitstring.i:99 WT-1 0x000000000044dbef in __bit_setv ( bitf=0x8044222a8 "wivk??e?WoVn???}\226???\227Uf_?gz\233k?\233?ۦ宿g????f?{zw????m?\236?\231?yZo?\225???ݷiyj??i?v~??U?V????f\233\236\237?i??i???ݥ\225\225w?w?u?\237???k?vo?o??ޯ??\236?W??W??^Y??\237??\225?\177[\177\235?ݾy???w??[knZ??^?ߺ]\231\233{\225\231????[g~\231?o~u}??ݺ?\226?????????"..., entry=18446744073709551614, width=2 '\002', value=3 '\003') at bitstring.i:303 WT-2 0x000000000044e741 in __bit_setv_recno (page=0x8044208e0, recno=18446744073709551615, width=2 '\002', value=3 '\003') at bitstring.i:315 WT-3 0x000000000044e3c5 in __rec_col_fix (session=0x80268b660, r=0x804603400, page=0x8044208e0) at ../src/btree/rec_write.c:2331 WT-4 0x00000000004494fc in __wt_rec_write (session=0x80268b660, page=0x8044208e0, salvage=0x0, flags=0) at ../src/btree/rec_write.c:317
What's happening is that we're reconciling a page, and there's a record on the page's update list that has an invalid record number.
Inside the page->modify->update skip list, I see:
$1775 = {upd = 0x80243b200, u = {recno = 646, key = {offset = 646, size = 0}}, next = 0x80243b1f0} $1776 = {upd = 0x805844380, u = {recno = 651, key = {offset = 651, size = 0}}, next = 0x805844370} $1777 = {upd = 0x805c3b540, u = {recno = 18446744073709551615, key = { offset = 4294967295, size = 4294967295}}, next = 0x80385a5b0} $1778 = {upd = 0x80243b740, u = {recno = 657, key = {offset = 657, size = 0}}, next = 0x80381f050} $1779 = {upd = 0x80243ba80, u = {recno = 658, key = {offset = 658, size = 0}}, next = 0x8024401d0}
$1777 has a recno of UINT64_MAX, and offset/size values of UINT32_MAX, but the WT_UPDATE structure looks reasonable:
(gdb) p *(WT_UPDATE *)0x805c3b540 $1791 = {size = 1, txnid = 4007, next = 0x0}
Entry $1777 isn't at the head of any of the skip-list chains, it's right in the middle of all of them. (That's why I don't think this has anything to do with the WT-638 branch, this entry was pushed into the list long after the list itself was allocated).
I would think that a recno of UINT64_MAX would be a smoking gun, but I can't find it. There's a little dance in __wt_btcur_insert():
/* * If WT_CURSTD_APPEND is set, insert a new record (ignoring * the application's record number). First we search for the * maximum possible record number so the search ends on the * last page. The real record number is assigned by the * serialized append operation. */ if (F_ISSET(cursor, WT_CURSTD_APPEND)) cbt->iface.recno = UINT64_MAX; WT_ERR(__wt_col_search(session, cbt, 1)); if (F_ISSET(cursor, WT_CURSTD_APPEND)) cbt->iface.recno = 0;
but I don't see how it's possible for the recno to be set incorrectly – imagine that we get WT_RESTART from __wt_col_search(), the next time through the loop we should still have WT_CURSTD_APPEND set.
Here's the config that saw this error:
############################################ # RUN PARAMETERS ############################################ bitcnt=2 cache=40 compression=snappy data_extend=0 data_source=file delete_pct=0 # dictionary not applicable to this run file_type=fixed-length column-store hot_backups=0 # huffman_key not applicable to this run # huffman_value not applicable to this run insert_pct=41 internal_key_truncation=0 internal_page_max=17 key_gap=17 # key_max not applicable to this run # key_min not applicable to this run leaf_page_max=12 ops=5000 # prefix not applicable to this run # repeat_data_pct not applicable to this run reverse=0 rows=1000 runs=0 split_pct=43 threads=31 # value_max not applicable to this run # value_min not applicable to this run # wiredtiger_config not applicable to this run write_pct=20 ############################################
@michaelcahill, any ideas?
- is related to
-
WT-656 test/format failure: checkpoint ranges never verified
- Closed
- related to
-
WT-1 placeholder WT-1
- Closed
-
WT-2 What does metadata look like?
- Closed
-
WT-3 What file formats are required?
- Closed
-
WT-4 Flexible cursor traversals
- Closed
-
WT-638 Modify update atomic
- Closed
-
WT-643 test/format failure: illegal cell and page type combination
- Closed