test/format failed on tinderbox a couple days ago.
http://mjc.homeunix.org:8180/job/wiredtiger-test-format-stress/5419/
This console says:
[1412512381:36088][42905:00577afe927f0000], t, file:wt, eviction-worker: discarded page has hazard pointer: (0x7f92dc209810: ../src/btree/col_srch.c, line 95) [1412512381:36142][42905:00577afe927f0000], t, file:wt, eviction-worker: ../src/btree/bt_discard.c, 67: hp == NULL [1412512381:36148][42905:00577afe927f0000], t, file:wt, eviction-worker: aborting WiredTiger library
Although the hp->page is listed in the output, in the core file it is already NULL:
(gdb) p hp $1 = (WT_HAZARD *) 0x2788130 (gdb) p *hp $2 = {page = 0x0, file = 0x67364d "../src/btree/col_srch.c", line = 95} (gdb) p *(WT_PAGE*)0x7f92dc209810 $3 = {u = {intl = {recno = 28400, parent_ref = 0x7f92dc209858, __index = 0x7f92dc040a10}, row = {d = 0x6ef0, ins = 0x7f92dc209858, upd = 0x7f92dc040a10, entries = 91}, col_fix = {recno = 28400, bitf = 0x7f92dc209858 "(", entries = 3691252240}, col_var = {recno = 28400, d = 0x7f92dc209858, repeats = 0x7f92dc040a10, nrepeats = 91, entries = 28231}}, dsk = 0x7f92dc252bb0, modify = 0x0, read_gen = 6334137, memory_footprint = 1180930, type = 4 '\004', flags_atomic = 2 '\002'}
Here are the stacks of all interesting threads (4 of 8). Thread 1 is the one that took the assertion. Thread 5 shows the file/line in the hazard pointer.
(gdb) thread apply all bt Thread 5 (Thread 0x7f92ff7a7700 (LWP 43594)): #0 0x000000339e0daf87 in sched_yield () at ../sysdeps/unix/syscall-template.S:81 WT-1 0x000000000042e67d in __wt_yield () at ../src/os_posix/os_yield.c:17 WT-2 0x00000000004c0f0e in __wt_page_in_func (session=0x2792e10, ref=0x7f92bc1d2770, flags=0, file=0x67364d "../src/btree/col_srch.c", line=95) at ../src/btree/bt_page.c:152 WT-3 0x00000000004c5eb0 in __wt_page_swap_func (session=0x2792e10, held=0x27cc4e8, want=0x7f92bc1d2770, flags=0, file=0x67364d "../src/btree/col_srch.c", line=95) at ../src/include/btree.i:1008 WT-4 0x00000000004c66c0 in __wt_col_search (session=0x2792e10, recno=56723, leaf=0x0, cbt=0x7f92dc041780) at ../src/btree/col_srch.c:95 WT-5 0x00000000004b2f67 in __cursor_col_search (session=0x2792e10, cbt=0x7f92dc041780) at ../src/btree/bt_cursor.c:223 WT-6 0x00000000004b3292 in __wt_btcur_search (cbt=0x7f92dc041780) at ../src/btree/bt_cursor.c:308 WT-7 0x000000000048927f in __curfile_search (cursor=0x7f92dc041780) at ../src/cursor/cur_file.c:165 WT-8 0x00000000004088d9 in col_remove (cursor=0x7f92dc041780, key=0x7f92ff7a6e60, keyno=56723, notfoundp=0x7f92ff7a6e14) at ../../../test/format/ops.c:1031 WT-9 0x0000000000406edd in ops (arg=0x27cce80) at ../../../test/format/ops.c:345 WT-10 0x000000339e807c53 in start_thread (arg=0x7f92ff7a7700) at pthread_create.c:308 WT-11 0x000000339e0f5dbd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Thread 4 (Thread 0x7f92fd1a5700 (LWP 43593)): #0 0x0000000000473bb0 in __rec_col_var (session=0x2793090, r=0x7f92b421f660, page=0x7f92b427fa00, salvage=0x0) at ../src/btree/rec_write.c:3581 WT-1 0x000000000046d6e9 in __wt_rec_write (session=0x2793090, ref=0x27cc720, salvage=0x0, flags=5) at ../src/btree/rec_write.c:405 WT-2 0x0000000000467232 in __rec_review (session=0x2793090, ref=0x27cc720, exclusive=0, top=1, istree=0x7f92fd1a4a34) at ../src/btree/rec_evict.c:390 WT-3 0x0000000000466737 in __wt_rec_evict (session=0x2793090, ref=0x27cc720, exclusive=0) at ../src/btree/rec_evict.c:54 WT-4 0x0000000000457f6d in __wt_evict_page (session=0x2793090, ref=0x27cc720) at ../src/btree/bt_evict.c:618 WT-5 0x0000000000459096 in __wt_evict_lru_page (session=0x2793090, is_app=1) at ../src/btree/bt_evict.c:1220 WT-6 0x00000000004b271c in __wt_cache_full_check (session=0x2793090) at ../src/include/cache.i:133 WT-7 0x00000000004b28fb in __cursor_enter (session=0x2793090) at ../src/include/cursor.i:62 WT-8 0x00000000004b29d4 in __curfile_enter (cbt=0x7f92b44a5c10) at ../src/include/cursor.i:99 WT-9 0x00000000004b2b01 in __cursor_func_init (cbt=0x7f92b44a5c10, reenter=1) at ../src/include/cursor.i:145 WT-10 0x00000000004b3248 in __wt_btcur_search (cbt=0x7f92b44a5c10) at ../src/btree/bt_cursor.c:306 WT-11 0x000000000048927f in __curfile_search (cursor=0x7f92b44a5c10) at ../src/cursor/cur_file.c:165 WT-12 0x00000000004075de in read_row (cursor=0x7f92b44a5c10, key=0x7f92fd1a4e60, keyno=70321) at ../../../test/format/ops.c:542 WT-13 0x000000000040714e in ops (arg=0x27cce30) at ../../../test/format/ops.c:416 WT-14 0x000000339e807c53 in start_thread (arg=0x7f92fd1a5700) at pthread_create.c:308 WT-15 0x000000339e0f5dbd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Thread 2 (Thread 0x7f92fdfa4700 (LWP 52680)): #0 0x000000000046ba4a in __wt_cell_unpack_safe (cell=0x7f92a416685c, unpack=0x7f92fdfa3a00, end=0x0) at ../src/include/cell.i:599 WT-1 0x000000000046bd87 in __wt_cell_unpack (cell=0x7f92a416685c, unpack=0x7f92fdfa3a00) at ../src/include/cell.i:709 WT-2 0x0000000000470343 in __rec_split_raw_worker (session=0x2792410, r=0x7f92a412ad40, no_more_rows=1) at ../src/btree/rec_write.c:2118 WT-3 0x000000000047138c in __rec_split_finish_raw (session=0x2792410, r=0x7f92a412ad40) at ../src/btree/rec_write.c:2548 WT-4 0x00000000004713dd in __rec_split_finish (session=0x2792410, r=0x7f92a412ad40) at ../src/btree/rec_write.c:2563 WT-5 0x0000000000474345 in __rec_col_var (session=0x2792410, r=0x7f92a412ad40, page=0x7f92b41e0000, salvage=0x0) at ../src/btree/rec_write.c:3878 WT-6 0x000000000046d6e9 in __wt_rec_write (session=0x2792410, ref=0x7f92bc1d0830, salvage=0x0, flags=5) at ../src/btree/rec_write.c:405 WT-7 0x0000000000467232 in __rec_review (session=0x2792410, ref=0x7f92bc1d0830, exclusive=0, top=1, istree=0x7f92fdfa3dc4) at ../src/btree/rec_evict.c:390 WT-8 0x0000000000466737 in __wt_rec_evict (session=0x2792410, ref=0x7f92bc1d0830, exclusive=0) at ../src/btree/rec_evict.c:54 WT-9 0x0000000000457f6d in __wt_evict_page (session=0x2792410, ref=0x7f92bc1d0830) at ../src/btree/bt_evict.c:618 WT-10 0x0000000000459096 in __wt_evict_lru_page (session=0x2792410, is_app=1) at ../src/btree/bt_evict.c:1220 WT-11 0x0000000000458157 in __evict_lru_pages (session=0x2792410, is_app=1) at ../src/btree/bt_evict.c:703 WT-12 0x000000000045779f in __evict_worker (arg=0x27b8a50) at ../src/btree/bt_evict.c:365 WT-13 0x000000339e807c53 in start_thread (arg=0x7f92fdfa4700) at pthread_create.c:308 WT-14 0x000000339e0f5dbd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Thread 1 (Thread 0x7f92fe7a5700 (LWP 52679)): #0 0x000000339e0359e9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 WT-1 0x000000339e0370f8 in __GI_abort () at abort.c:90 WT-2 0x00000000004a49bb in __wt_abort (session=0x2792190) at ../src/os_posix/os_abort.c:24 WT-3 0x0000000000443dff in __wt_assert (session=0x2792190, error=0, file_name=0x6724f8 "../src/btree/bt_discard.c", line_number=67, fmt=0x6724b7 "%s") at ../src/support/err.c:452 WT-4 0x00000000004b8e0e in __wt_page_out (session=0x2792190, pagep=0x7f92bc1d2770) at ../src/btree/bt_discard.c:67 WT-5 0x00000000004b8cae in __wt_ref_out (session=0x2792190, ref=0x7f92bc1d2770) at ../src/btree/bt_discard.c:32 WT-6 0x0000000000466ae9 in __rec_page_clean_update (session=0x2792190, ref=0x7f92bc1d2770) at ../src/btree/rec_evict.c:126 WT-7 0x00000000004668a9 in __wt_rec_evict (session=0x2792190, ref=0x7f92bc1d2770, exclusive=0) at ../src/btree/rec_evict.c:80 WT-8 0x0000000000457f6d in __wt_evict_page (session=0x2792190, ref=0x7f92bc1d2770) at ../src/btree/bt_evict.c:618 WT-9 0x0000000000459096 in __wt_evict_lru_page (session=0x2792190, is_app=1) at ../src/btree/bt_evict.c:1220 WT-10 0x0000000000458157 in __evict_lru_pages (session=0x2792190, is_app=1) at ../src/btree/bt_evict.c:703 WT-11 0x000000000045779f in __evict_worker (arg=0x27b8a30) at ../src/btree/bt_evict.c:365 WT-12 0x000000339e807c53 in start_thread (arg=0x7f92fe7a5700) at pthread_create.c:308 WT-13 0x000000339e0f5dbd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
So we have a race with page in/out and hazard pointers. Thread 5 is trying to set the hazard pointer in *wt_hazard_set. After setting the page, file and line, we do a WT_FULL_BARRIER. Then it decides that the page is not available, setting hp->page = NULL but without publishing that. At the same time, thread 1, an eviction thread, is doing its last-minute checks on page out. It is in *wt_page_hazard_check and happens to be walking that session's hazard pointers and finds the same hp->page.
In __wt_hazard_set we say:
* We don't bother publishing this update: the worst case is we * prevent some random page from being evicted.
Perhaps we do need to publish this update. Although it seems like that just narrows a window anyway. We set the page and then could be descheduled before deciding it's busy and clearing the page.
- related to
-
WT-1 placeholder WT-1
- Closed
-
WT-2 What does metadata look like?
- Closed
-
WT-3 What file formats are required?
- Closed
-
WT-4 Flexible cursor traversals
- Closed
-
WT-5 How does pget work: is it necessary?
- Closed
-
WT-6 Complex schema example
- Closed
-
WT-7 Do we need the handle->err/errx methods?
- Closed
-
WT-8 Do we need table load, bulk-load and/or dump methods?
- Closed
-
WT-9 Does adding schema need to be transactional?
- Closed
-
WT-10 Basic "getting started" tutorial
- Closed
-
WT-11 placeholder #11
- Closed
-
WT-12 Write more examples
- Closed
-
WT-13 Define supported platforms
- Closed
-
WT-14 Windows build
- Closed
-
WT-15 Automated build/test infrastructure
- Closed
-
WT-1039 discarded page has hazard pointer
- Closed