-
Type: Bug
-
Resolution: Fixed
-
Priority: Minor - P4
-
Affects Version/s: None
-
Component/s: None
Background
While testing WT-7267, I saw this assert fire once:
(gdb) bt #0 0x00007f21ef6ece1f in __GI___select (nfds=0, readfds=0x0, writefds=0x0, exceptfds=0x0, timeout=0x7f21b9c011a0) at ../sysdeps/unix/sysv/linux/select.c:41 #1 0x00000000007e7ed6 in __wt_sleep (seconds=100, micro_seconds=0) at ../src/os_posix/os_sleep.c:30 #2 0x00000000007c303e in __wt_abort (session=0x7f21f0c236c0) at ../src/os_common/os_abort.c:26 #3 0x000000000068cfa6 in __curhs_search_near (cursor=0x614000088240, exactp=0x7f21b9c01bc0) at ../src/cursor/cur_hs.c:761 #4 0x000000000052c260 in hs_cursor (arg=0x0) at ../../../test/format/hs.c:83 #5 0x00000000004e8edf in __asan::AsanThread::ThreadStart(unsigned long, __sanitizer::atomic_uintptr_t*) () #6 0x00007f21f05c56db in start_thread (arg=0x7f21b9c02700) at pthread_create.c:463 #7 0x00007f21ef6f771f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 (gdb) f 3 #3 0x000000000068cfa6 in __curhs_search_near (cursor=0x614000088240, exactp=0x7f21b9c01bc0) at ../src/cursor/cur_hs.c:761 761 WT_ASSERT( (gdb) list 756 } 757 } 758 759 #ifdef HAVE_DIAGNOSTIC 760 WT_ERR(__wt_compare(session, NULL, &file_cursor->key, srch_key, &cmp)); 761 WT_ASSERT( 762 session, (cmp == 0 && *exactp == 0) || (cmp < 0 && *exactp < 0) || (cmp > 0 && *exactp > 0)); 763 #endif 764 765 __curhs_set_key_ptr(cursor, file_cursor); (gdb) p cmp $15 = -1 (gdb) p *exactp $16 = 1 (gdb) p (char*)srch_key->data $17 = 0x632000468800 "\202\234\060\060\060\060\063\071\061\060\064\063.00/opqrstuvwxyzab\343\002\327\233\200" (gdb) p (char*)file_cursor->key.data $18 = 0x6060001065f8 "\202\234\060\060\060\060\063\071\061\060\064\063.00/opqrstuvwxyzab\200\200" (gdb) p (char*)datastore_key.data $19 = 0x6060001065fa "0000391043.00/opqrstuvwxyzab\200\200" (gdb) p datastore_key.size $20 = 28
If we land ahead of our search key, we try to look ahead and see if anything is visible. If nothing is visible in your key range, we then go backwards and see if anything is visible. However, we assume that as soon as we do a prev and see something in our key range, that it must be on the OTHER side of the search key and adjust exactp accordingly.
This is normally true because readers won't be reading the history store while concurrent inserts are happening for that particular key range. In the case of eviction, the page will be locked from readers. In the case of checkpoint, the inserts written to history store will still be in the update chain so readers won't need to go to history store to find them anyway.
The history store thread in format violates this assumption by opening a history store cursor within format and just doing random operations and search_near across the table. The search_near implementation specifically doesn't expect that which lead to this assertion firing.
Scope
Figure out the best way to resolve this failure while keeping as much coverage as possible (so ideally not just removing the thread entirely). We could think about hardening search_near to handle this or to stop executing search_near since just iterating seems to be ok.