-
Type: Task
-
Resolution: Done
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
When I run against LinkBench, I occasionally see a hang:
34 pthread_cond_wait@@GLIBC_2.3.2,__wt_cond_wait,__wt_log_slot_wait,__wt_log_write,__wt_txn_log_commit,__wt_txn_commit,__session_commit_transaction 15 pthread_cond_wait@@GLIBC_2.3.2,__wt_cond_wait,__log_release,__wt_log_write,__wt_txn_log_commit,__wt_txn_commit,__session_commit_transaction 1 slot=0x27dd890),__wt_log_write,__wt_txn_log_commit,__wt_txn_commit,__session_commit_transaction
There are 50 application threads. 34 of those threads are waiting for their slots to be finalized. 15 of those threads are waiting for their chance to release their slot. 1 of those threads is attempting to find a new slot in the slot pool.
If I capture the hang in GDB I can see:
Thread 67 (Thread 0x7ff98827d700 (LWP 12971)): #0 0x00007ff9aa10c7cc in __wt_log_slot_close ( session=session@entry=0x1bf4118, slot=0x1c06670) at ../src/log/log_slot.c:197 WT-1 0x00007ff9aa10b440 in __wt_log_write (session=session@entry=0x1bf4118, record=0x7ff57c011940, lsnp=lsnp@entry=0x0, flags=2) at ../src/log/log.c:1018 WT-2 0x00007ff9aa13ad0d in __wt_txn_log_commit ( session=session@entry=0x1bf4118, cfg=cfg@entry=0x7ff98827abe0) at ../src/txn/txn_log.c:129 WT-3 0x00007ff9aa1391a6 in __wt_txn_commit (session=session@entry=0x1bf4118, cfg=cfg@entry=0x7ff98827abe0) at ../src/txn/txn.c:352 WT-4 0x00007ff9aa1317d5 in __session_commit_transaction (wt_session=0x1bf4118, config=<optimised out>) at ../src/session/session_api.c:649 (More stack frames follow...) (gdb) thread 67 [Switching to thread 67 (Thread 0x7ff98827d700 (LWP 12971))] #0 0x00007ff9aa10c7cc in __wt_log_slot_close ( session=session@entry=0x1bf4118, slot=0x1c06670) at ../src/log/log_slot.c:197 197 if (++log->pool_index >= SLOT_POOL) (gdb) p slot->slot_release_lsn $1 = {file = 155, offset = 42647808} (gdb) p log->write_lsn $2 = {file = 155, offset = 42648192} (gdb) p *slot $3 = {slot_state = 25347, slot_group_size = 384, slot_error = 0, slot_index = 0, slot_start_offset = 42647808, slot_release_lsn = { file = 155, offset = 42647808}, slot_start_lsn = {file = 155, offset = 42647808}, slot_end_lsn = {file = 155, offset = 42648192}, slot_fh = 0x7ff4d1e3b680, slot_buf = {data = 0x7ff89bddf010, size = 0, flags = 0, mem = 0x7ff89bddf010, memsize = 40706048}, slot_done_cond = 0x1c06d20, flags = 2} (gdb) p slot->slot_end_lsn $4 = {file = 155, offset = 42648192}
So.. The thread in
Unable to find source-code formatter for language: __wt_slot_close```. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml
has a
slot_release_lsn
that is one behind the current
log->write_lsn
. That should not happen. I think there is a real issue in this code, if there are two or more active slots, then a slot may own the oldest slot_release_lsn, and get stuck waiting for an available slot in
__wt_slot_close{{`}} which would lead to a hang.