When I ran the close-debug branch on the AWS HD box, it hung overnight during the populate phase (medium-multi-lsm). The appq is growing very large. I put a breakpoint in __wt_lsm_manager_pop_entry after acquiring the app spin lock and executed this gdb function. I killed it after a couple minutes while typing this up, as it hadn't finished yet, and the queue length was over 400K! [EDIT: I ran it a second time, intending to let it finish and ran out of patience and that one was over 400K, not the 200K below]
(gdb) define qcount >set $c = 0 >set $p = (&manager->appqh)->tqh_first >while ($p != 0) >set $c = $c + 1 >set $p = ($p)->q.tqe_next >end >printf "App Q units queued: %d", $c >end (gdb) qcount ^CQuit (gdb) print $c $9 = 203489
Two of the three LSM worker threads are waiting on the evict_waiter_cond. Here are the stacks of some interesting threads. Thread 5 is the LSM worker "stuck" in the app queue for loop. Thread 10 is the eviction thread. Threads 3 and 4 are the other LSM workers stuck in __wt_cache_full_check and Thread 2 is an application thread similarly stuck:
Thread 10 (Thread 0x7f2e9ec72700 (LWP 30474)): #0 __wt_tree_walk (session=0x1ef3990, refp=0x7f2e6c068c38, flags=736) at ../src/btree/bt_walk.c:205 WT-1 0x0000000000459fdd in __evict_walk_file (session=0x1ef3990, slotp=0x7f2e9ec71d0c, flags=3) at ../src/btree/bt_evict.c:988 WT-2 0x0000000000459a86 in __evict_walk (session=0x1ef3990, entriesp=0x7f2e9ec71d7c, flags=3) at ../src/btree/bt_evict.c:895 WT-3 0x0000000000459494 in __evict_lru (session=0x1ef3990, flags=3) at ../src/btree/bt_evict.c:722 WT-4 0x0000000000458e97 in __evict_pass (session=0x1ef3990) at ../src/btree/bt_evict.c:484 WT-5 0x0000000000458082 in __evict_server (arg=0x1ef3990) at ../src/btree/bt_evict.c:162 Thread 5 (Thread 0x7f2e9c46d700 (LWP 30479)): #0 __wt_lsm_manager_pop_entry (session=0x1ef4890, type=18, entryp=0x7f2e9c46ce38) at ../src/lsm/lsm_manager.c:502 WT-1 0x0000000000427f36 in __lsm_worker_general_op (session=0x1ef4890, cookie=0x1f2cf30, completed=0x7f2e9c46ce70) at ../src/lsm/lsm_worker.c:43 WT-2 0x00000000004281dd in __lsm_worker (arg=0x1f2cf30) at ../src/lsm/lsm_worker.c:111 Thread 4 (Thread 0x7f2e9bc6c700 (LWP 30480)): #0 0x00007f2e9fb55e2e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 WT-1 0x000000000042c98c in __wt_cond_wait (session=0x1ef4b10, cond=0x1f0ade0, usecs=100000) at ../src/os_posix/os_mtx.c:77 WT-2 0x00000000004aefca in __wt_cache_full_check (session=0x1ef4b10) at ../src/include/cache.i:166 WT-3 0x00000000004af0ff in __cursor_enter (session=0x1ef4b10) at ../src/include/cursor.i:62 WT-4 0x00000000004af1d8 in __curfile_enter (cbt=0x7f2e700ae310) at ../src/include/cursor.i:99 WT-5 0x00000000004af305 in __cursor_func_init (cbt=0x7f2e700ae310, reenter=0) at ../src/include/cursor.i:145 WT-6 0x00000000004b0b59 in __wt_btcur_next (cbt=0x7f2e700ae310, truncating=0) at ../src/btree/bt_curnext.c:399 WT-7 0x000000000048ade1 in __curfile_next (cursor=0x7f2e700ae310) at ../src/cursor/cur_file.c:79 WT-8 0x000000000049d39b in __clsm_next (cursor=0x7f2e70dc8d40) at ../src/lsm/lsm_cursor.c:794 WT-9 0x00000000004a50b6 in __lsm_bloom_create (session=0x1ef4b10, lsm_tree=0x1f6e2f0, chunk=0x7f2e6c066f20, chunk_off=2) at ../src/lsm/lsm_work_unit.c:376 WT-10 0x00000000004a479f in __wt_lsm_work_bloom (session=0x1ef4b10, lsm_tree=0x1f6e2f0) at ../src/lsm/lsm_work_unit.c:191 WT-11 0x00000000004280cf in __lsm_worker_general_op (session=0x1ef4b10, cookie=0x1f2cf50, completed=0x7f2e9bc6be70) at ../src/lsm/lsm_worker.c:69 ---Type <return> to continue, or q <return> to quit--- WT-12 0x00000000004281dd in __lsm_worker (arg=0x1f2cf50) at ../src/lsm/lsm_worker.c:111 Thread 3 (Thread 0x7f2e9b46b700 (LWP 30481)): #0 0x00007f2e9fb55e2e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 WT-1 0x000000000042c98c in __wt_cond_wait (session=0x1ef4d90, cond=0x1f0ade0, usecs=100000) at ../src/os_posix/os_mtx.c:77 WT-2 0x00000000004aefca in __wt_cache_full_check (session=0x1ef4d90) at ../src/include/cache.i:166 WT-3 0x00000000004af0ff in __cursor_enter (session=0x1ef4d90) at ../src/include/cursor.i:62 WT-4 0x00000000004af1d8 in __curfile_enter (cbt=0x7f2e78fc1700) at ../src/include/cursor.i:99 WT-5 0x00000000004af305 in __cursor_func_init (cbt=0x7f2e78fc1700, reenter=0) at ../src/include/cursor.i:145 WT-6 0x00000000004b0b59 in __wt_btcur_next (cbt=0x7f2e78fc1700, truncating=0) at ../src/btree/bt_curnext.c:399 WT-7 0x000000000048ade1 in __curfile_next (cursor=0x7f2e78fc1700) at ../src/cursor/cur_file.c:79 WT-8 0x000000000049d39b in __clsm_next (cursor=0x7f2e78357b20) at ../src/lsm/lsm_cursor.c:794 WT-9 0x00000000004a50b6 in __lsm_bloom_create (session=0x1ef4d90, lsm_tree=0x1f6e2f0, chunk=0x7f2e6c01ec20, chunk_off=3) at ../src/lsm/lsm_work_unit.c:376 WT-10 0x00000000004a479f in __wt_lsm_work_bloom (session=0x1ef4d90, lsm_tree=0x1f6e2f0) at ../src/lsm/lsm_work_unit.c:191 WT-11 0x00000000004280cf in __lsm_worker_general_op (session=0x1ef4d90, cookie=0x1f2cf70, completed=0x7f2e9b46ae70) at ../src/lsm/lsm_worker.c:69 WT-12 0x00000000004281dd in __lsm_worker (arg=0x1f2cf70) at ../src/lsm/lsm_worker.c:111 Thread 2 (Thread 0x7f2e9ac6a700 (LWP 30482)): #0 0x00007f2e9fb55e2e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 WT-1 0x000000000042c98c in __wt_cond_wait (session=0x1ef4390, cond=0x1f0ade0, usecs=100000) at ../src/os_posix/os_mtx.c:77 WT-2 0x000000000049ae53 in __wt_cache_full_check (session=0x1ef4390) at ../src/include/cache.i:166 WT-3 0x000000000049b423 in __clsm_enter (clsm=0x7f2e8c001ae0, reset=0, update=1) at ../src/lsm/lsm_cursor.c:164 WT-4 0x000000000049f450 in __clsm_insert (cursor=0x7f2e8c001ae0) at ../src/lsm/lsm_cursor.c:1330 WT-5 0x0000000000407195 in populate_thread (arg=0x1f54050) at ../../../bench/wtperf/wtperf.c:818
It isn't clear to me that the queue length is the real issue, or whether eviction is not making enough progress to unplug the other two threads so that they can help with the work queue. The first time I attached to the process, the eviction server was sleeping, while all Threads 2, 3, 4 were still waiting on the condition:
Thread 10 (Thread 0x7f2e9ec72700 (LWP 30474)): #0 0x00007f2e9fb55e2e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 WT-1 0x000000000042c98c in __wt_cond_wait (session=0x1ef3990, cond=0x1f0ad70, usecs=100000) at ../src/os_posix/os_mtx.c:77 WT-2 0x00000000004582ab in __evict_server (arg=0x1ef3990) at ../src/btree/bt_evict.c:194
- is related to
-
WT-1201 medium-lsm-multi failure in WT_CONNECTION::close
- Closed
- related to
-
WT-2 What does metadata look like?
- Closed
-
WT-3 What file formats are required?
- Closed
-
WT-4 Flexible cursor traversals
- Closed
-
WT-5 How does pget work: is it necessary?
- Closed
-
WT-6 Complex schema example
- Closed
-
WT-7 Do we need the handle->err/errx methods?
- Closed
-
WT-8 Do we need table load, bulk-load and/or dump methods?
- Closed
-
WT-9 Does adding schema need to be transactional?
- Closed
-
WT-10 Basic "getting started" tutorial
- Closed
-
WT-11 placeholder #11
- Closed
-
WT-1231 Lsm cleanup
- Closed