-
Type: Task
-
Resolution: Done
-
Affects Version/s: None
-
Component/s: None
-
None
Michael and I ran into an issue today when running test/format on the LSM code.
It turns out that there is an issue when doing a checkpoint while closing a bulk cursor. The issue isn't related to LSM.
I've made some changes to the fop test application that demonstrate the problem. I pushed the changes to a new branch fops-bulk (https://github.com/wiredtiger/wiredtiger/tree/fops-bulk).
If I run fops with:
./t -n 1000 -r 1 -t 2
It regularly hangs. When I capture the state in a debugger, I can see:
Thread 4 (process 6614): #0 0x00007fff8df83122 in __psynch_mutexwait () WT-1 0x00007fff8f23cddd in pthread_mutex_lock () WT-2 0x000000010005595c in __wt_spin_lock (session=0x100804c30, t=0x1008044f0) at mutex.i:81 WT-3 0x0000000100055852 in __curbulk_close (cursor=0x101800500) at cur_bulk.c:53 WT-4 0x00000001000013ac in obj_bulk () at file.c:31 WT-5 0x0000000100001ca3 in fop (arg=0x1) at fops.c:134 WT-6 0x00007fff8f237782 in _pthread_start () WT-7 0x00007fff8f2241c1 in thread_start () Thread 3 (process 6614): #0 0x00007fff8df8315e in __psynch_rw_rdlock () WT-1 0x00007fff8f23d915 in pthread_rwlock_rdlock () WT-2 0x0000000100067a43 in __wt_readlock (session=0x100804e48, rwlock=0x100600500) at os_mtx.c:176 WT-3 0x0000000100051a41 in __conn_btree_open_lock (session=0x100804e48, flags=0) at conn_btree.c:36 WT-4 0x0000000100051c8d in __conn_btree_get (session=0x100804e48, name=0x1018002f0 "file:__wt", ckpt=0x0, flags=0) at conn_btree.c:106 WT-5 0x000000010005249d in __wt_conn_btree_get (session=0x100804e48, name=0x1018002f0 "file:__wt", ckpt=0x0, cfg=0x0, flags=0) at conn_btree.c:254 WT-6 0x000000010007e507 in __wt_session_get_btree (session=0x100804e48, uri=0x1018002f0 "file:__wt", checkpoint=0x0, cfg=0x0, flags=0) at session_btree.c:244 WT-7 0x00000001000624c6 in __wt_meta_btree_apply (session=0x100804e48, func=0x100086c30 <__wt_checkpoint>, cfg=0x100480e48, flags=0) at meta_apply.c:37 WT-8 0x0000000100086673 in __wt_txn_checkpoint (session=0x100804e48, cfg=0x100480e48) at txn_ckpt.c:100 WT-9 0x000000010007d76b in __session_checkpoint (wt_session=0x100804e48, config=0x100087c12 "name=fops") at session_api.c:509 WT-10 0x000000010000169e in obj_checkpoint () at file.c:84 WT-11 0x0000000100001c5d in fop (arg=0x0) at fops.c:122 WT-12 0x00007fff8f237782 in _pthread_start () WT-13 0x00007fff8f2241c1 in thread_start ()
The bulk close is attempting to get the schema lock while holding the handle lock. The checkpoint is attempting to get the handle lock while holding the schema lock.
I'm wondering if checkpoint should skip files that are being used for bulk load. Do you think that is a reasonable approach? I guess it would skip creating an empty file in a checkpoint if the open happened before a checkpoint started and the bulk cursor was opened after.
- related to
-
WT-1 placeholder WT-1
- Closed
-
WT-2 What does metadata look like?
- Closed
-
WT-3 What file formats are required?
- Closed
-
WT-4 Flexible cursor traversals
- Closed
-
WT-5 How does pget work: is it necessary?
- Closed
-
WT-6 Complex schema example
- Closed
-
WT-7 Do we need the handle->err/errx methods?
- Closed
-
WT-8 Do we need table load, bulk-load and/or dump methods?
- Closed
-
WT-9 Does adding schema need to be transactional?
- Closed
-
WT-10 Basic "getting started" tutorial
- Closed
-
WT-11 placeholder #11
- Closed
-
WT-12 Write more examples
- Closed
-
WT-13 Define supported platforms
- Closed