Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-7529

Cache stuck in format-stress-sanitizer-smoke test

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None

      Running a patch build for WT-7519, I hit a failure as described below.

      After puzzling over it a bit I can't see how it could be related to that change. So I'm opening a new issue.

      We fail running an ASAN build of format on the following config:

      ############################################
      #  RUN PARAMETERS: V2
      ############################################
      assert.commit_timestamp=0
      assert.read_timestamp=0
      backup=1
      backup.incremental=block
      backup.incr_granularity=16384
      btree.bitcnt=6
      btree.compression=zlib
      btree.dictionary=0
      btree.huffman_value=0
      btree.internal_key_truncation=1
      btree.internal_page_max=15
      btree.key_gap=2
      btree.key_max=56
      btree.key_min=21
      btree.leaf_page_max=13
      btree.memory_page_max=6
      btree.prefix_compression=1
      btree.prefix_compression_min=1
      btree.repeat_data_pct=7
      btree.reverse=0
      btree.split_pct=70
      btree.value_max=709
      btree.value_min=20
      cache=84
      cache.evict_max=3
      cache.minimum=20
      checkpoint=on
      checkpoint.log_size=119
      checkpoint.wait=54
      disk.checksum=uncompressed
      disk.data_extend=0
      disk.direct_io=0
      disk.encryption=none
      disk.firstfit=0
      disk.mmap=1
      disk.mmap_all=0
      format.abort=0
      format.independent_thread_rng=1
      format.major_timeout=0
      import=0
      logging=1
      logging.archive=0
      logging.compression=none
      logging.file_max=24619
      logging.prealloc=1
      lsm.auto_throttle=1
      lsm.bloom=1
      lsm.bloom_bit_count=33
      lsm.bloom_hash_count=31
      lsm.bloom_oldest=0
      lsm.chunk_size=2
      lsm.merge_max=5
      lsm.worker_threads=3
      ops.alter=0
      ops.compaction=0
      ops.hs_cursor=1
      ops.pct.delete=0
      ops.pct.insert=1
      ops.pct.modify=12
      ops.pct.read=0
      ops.pct.write=87
      ops.prepare=0
      ops.random_cursor=0
      ops.salvage=0
      ops.truncate=1
      ops.verify=1
      quiet=1
      runs=1
      runs.in_memory=0
      runs.ops=0
      runs.rows=100000
      runs.source=table
      runs.threads=6
      runs.timer=4
      runs.type=row-store
      runs.verify_failure_dump=0
      statistics=0
      statistics.server=0
      stress.aggressive_sweep=0
      stress.checkpoint=0
      stress.checkpoint_prepare=0
      stress.hs_checkpoint_delay=0
      stress.hs_search=1
      stress.hs_sweep=0
      stress.split_1=0
      stress.split_2=0
      stress.split_3=0
      stress.split_4=0
      stress.split_5=0
      stress.split_6=0
      stress.split_7=0
      stress.split_8=0
      transaction.frequency=100
      transaction.isolation=snapshot
      transaction.rollback_to_stable=0
      transaction.timestamps=1
      wiredtiger.config=
      wiredtiger.rwlock=1
      wiredtiger.leak_memory=0
      ############################################
      

      It's a cache stuck error after a bunch of transactions have been rolled back for eviction:

      [2021/05/12 02:08:58.912] format.sh: job in /data/mci/71d478ee5cba875843c23ea5b22ebad8/wiredtiger/test/format/RUNDIR.1 failed
       [2021/05/12 02:08:58.913]     t: process 11942 running
       [2021/05/12 02:08:58.913]     0x7f8c2ac8d7a0:oldest pinned transaction ID rolled back for eviction
       [2021/05/12 02:08:58.913]     0x7f8c2ac8db98:oldest pinned transaction ID rolled back for eviction
       [2021/05/12 02:08:58.913]     0x7f8c2ac8e388:oldest pinned transaction ID rolled back for eviction
       [2021/05/12 02:08:58.913]     0x7f8c2ac8e780:oldest pinned transaction ID rolled back for eviction
       [2021/05/12 02:08:58.913]     0x7f8c2ac8eb78:oldest pinned transaction ID rolled back for eviction
       [2021/05/12 02:08:58.913]     0x7f8c2ac8d7a0:oldest pinned transaction ID rolled back for eviction
       [2021/05/12 02:08:58.913]     0x7f8c2ac8df90:oldest pinned transaction ID rolled back for eviction
       [2021/05/12 02:08:58.913]     0x7f8c2ac8e780:oldest pinned transaction ID rolled back for eviction
       [2021/05/12 02:08:58.913]     0x7f8c2ac8eb78:oldest pinned transaction ID rolled back for eviction
       [2021/05/12 02:08:58.913]     0x7f8c2ac8db98:oldest pinned transaction ID rolled back for eviction
       [2021/05/12 02:08:58.913]     0x7f8c2ac8eb78:oldest pinned transaction ID rolled back for eviction
       [2021/05/12 02:08:58.913]     0x7f8c2ac8db98:oldest pinned transaction ID rolled back for eviction
       [2021/05/12 02:08:58.913]     0x7f8c2ac8e388:oldest pinned transaction ID rolled back for eviction
       [2021/05/12 02:08:58.913]     0x7f8c2ac8df90:oldest pinned transaction ID rolled back for eviction
       [2021/05/12 02:08:58.913]     0x7f8c2ac8e388:oldest pinned transaction ID rolled back for eviction
       [2021/05/12 02:08:58.913]     0x7f8c2ac8e388:oldest pinned transaction ID rolled back for eviction
       [2021/05/12 02:08:58.913]     0x7f8c2ac8eb78:oldest pinned transaction ID rolled back for eviction
       [2021/05/12 02:08:58.913]     0x7f8c2ac8db98:oldest pinned transaction ID rolled back for eviction
       [2021/05/12 02:08:58.913]     0x7f8c2ac8e780:oldest pinned transaction ID rolled back for eviction
       [2021/05/12 02:08:58.913]     0x7f8c2ac8e388:oldest pinned transaction ID rolled back for eviction
       [2021/05/12 02:08:58.913]     0x7f8c2ac8e780:oldest pinned transaction ID rolled back for eviction
       [2021/05/12 02:08:58.913]     0x7f8c2ac8eb78:oldest pinned transaction ID rolled back for eviction
       [2021/05/12 02:08:58.913]     0x7f8c2ac8e780:oldest pinned transaction ID rolled back for eviction
       [2021/05/12 02:08:58.913]     0x7f8c2ac8db98:oldest pinned transaction ID rolled back for eviction
       [2021/05/12 02:08:58.913]     0x7f8c2ac8db98:oldest pinned transaction ID rolled back for eviction
       [2021/05/12 02:08:58.913]     0x7f8c2ac8e780:oldest pinned transaction ID rolled back for eviction
       [2021/05/12 02:08:58.913]     0x7f8c2ac8db98:oldest pinned transaction ID rolled back for eviction
       [2021/05/12 02:08:58.914]     0x7f8c2ac8df90:oldest pinned transaction ID rolled back for eviction
       [2021/05/12 02:08:58.914]     0x7f8c2ac8e388:oldest pinned transaction ID rolled back for eviction
       [2021/05/12 02:08:58.914]     0x7f8c2ac8eb78:oldest pinned transaction ID rolled back for eviction
       [2021/05/12 02:08:58.914]     0x7f8c2ac8d7a0:oldest pinned transaction ID rolled back for eviction
       [2021/05/12 02:08:58.914]     0x7f8c2ac8d7a0:oldest pinned transaction ID rolled back for eviction
       [2021/05/12 02:08:58.914]     [1620785336:267393][11942:0x7f8c1e51b700], eviction-server: __evict_server, 452: Cache stuck for too long, giving up: Connection timed out
       [2021/05/12 02:08:58.914]     0x7f8c2ac8bfd0:=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
       [2021/05/12 02:08:58.914]     0x7f8c2ac8bfd0:transaction state dump
       [2021/05/12 02:08:58.914]     0x7f8c2ac8bfd0:current ID: 17132
       [2021/05/12 02:08:58.914]     0x7f8c2ac8bfd0:last running ID: 17132
       [2021/05/12 02:08:58.914]     0x7f8c2ac8bfd0:metadata_pinned ID: 9723
       [2021/05/12 02:08:58.914]     0x7f8c2ac8bfd0:oldest ID: 17132
       [2021/05/12 02:08:58.914]     0x7f8c2ac8bfd0:durable timestamp: (0, 23606)
       [2021/05/12 02:08:58.914]     0x7f8c2ac8bfd0:oldest timestamp: (0, 23596)
       [2021/05/12 02:08:58.914]     0x7f8c2ac8bfd0:pinned timestamp: (0, 23596)
       [2021/05/12 02:08:58.914]     0x7f8c2ac8bfd0:stable timestamp: (0, 23606)
       [2021/05/12 02:08:58.914]     0x7f8c2ac8bfd0:has_durable_timestamp: yes
       [2021/05/12 02:08:58.914]     0x7f8c2ac8bfd0:has_oldest_timestamp: yes
       [2021/05/12 02:08:58.914]     0x7f8c2ac8bfd0:has_pinned_timestamp: yes
       [2021/05/12 02:08:58.914]     0x7f8c2ac8bfd0:has_stable_timestamp: yes
       [2021/05/12 02:08:58.914]     0x7f8c2ac8bfd0:oldest_is_pinned: yes
       [2021/05/12 02:08:58.914]     0x7f8c2ac8bfd0:stable_is_pinned: no
       [2021/05/12 02:08:58.914]     0x7f8c2ac8bfd0:checkpoint running: yes
       [2021/05/12 02:08:58.914]     0x7f8c2ac8bfd0:checkpoint generation: 3
       [2021/05/12 02:08:58.914]     0x7f8c2ac8bfd0:checkpoint pinned ID: 9693
       [2021/05/12 02:08:58.914]     0x7f8c2ac8bfd0:checkpoint txn ID: 9723
       [2021/05/12 02:08:58.914]     0x7f8c2ac8bfd0:session count: 22
       [2021/05/12 02:08:58.914]     0x7f8c2ac8bfd0:Transaction state of active sessions:
       [2021/05/12 02:08:58.914]     0x7f8c2ac8bfd0:ID: 0, pinned ID: 17132, metadata pinned ID: 17132, name: WT_CURSOR.next
       [2021/05/12 02:08:58.914]     0x7f8c2ac8bfd0:transaction id: 9723, mod count: 0, snap min: 9693, snap max: 9723, snapshot count: 6, commit_timestamp: (0, 0), durable_timestamp: (0, 0), first_commit_timestamp: (0, 0), prepare_timestamp: (0, 0), pinned_durable_timestamp: (0, 0), read_timestamp: (0, 0), checkpoint LSN: [2][2420992], full checkpoint: true, rollback reason: , flags: 0x0000040c, isolation: WT_ISO_READ_UNCOMMITTED
       [2021/05/12 02:08:58.914]     0x7f8c2ac8bfd0:=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
       [2021/05/12 02:08:58.914]     0x7f8c2ac8bfd0:cache dump
       [2021/05/12 02:08:58.914]     0x7f8c2ac8bfd0:cache full: no
       [2021/05/12 02:08:58.914]     0x7f8c2ac8bfd0:cache clean check: no (24.562%)
       [2021/05/12 02:08:58.914]     0x7f8c2ac8bfd0:cache dirty check: yes (22.684%)
       [2021/05/12 02:08:58.914]     0x7f8c2ac8bfd0:cache updates check: no (9.964%)
       [2021/05/12 02:08:58.914]     0x7f8c2ac8bfd0:file:wt.wt(<live>):
       [2021/05/12 02:08:58.914]     0x7f8c2ac8bfd0:internal: 1 pages, 0MB, 0/1 clean/dirty pages, 0/0 clean/dirty MB, 0MB max page, 0MB max dirty page
       [2021/05/12 02:08:58.914]     0x7f8c2ac8bfd0:leaf: 1440 pages, 16MB, 1/1439 clean/dirty pages, 0/16/7 clean/dirty/updates MB, 0MB max page, 0MB max dirty page
       [2021/05/12 02:08:58.914]     0x7f8c2ac8bfd0:file:WiredTigerHS.wt(<live>):
       [2021/05/12 02:08:58.914]     0x7f8c2ac8bfd0:internal: 1 pages, 0MB, 0/1 clean/dirty pages, 0/0 clean/dirty MB, 0MB max page, 0MB max dirty page
       [2021/05/12 02:08:58.914]     0x7f8c2ac8bfd0:leaf: 24 pages, 1MB, 15/9 clean/dirty pages, 1/0/0 clean/dirty/updates MB, 0MB max page, 0MB max dirty page
       [2021/05/12 02:08:58.914]     0x7f8c2ac8bfd0:file:WiredTiger.wt(<live>):
       [2021/05/12 02:08:58.914]     0x7f8c2ac8bfd0:internal: 1 pages, 0MB, 0/1 clean/dirty pages, 0/0 clean/dirty MB, 0MB max page, 0MB max dirty page
       [2021/05/12 02:08:58.914]     0x7f8c2ac8bfd0:leaf: 0 pages
       [2021/05/12 02:08:58.914]     0x7f8c2ac8bfd0:cache dump: total found: 20MB vs tracked inuse 19MB
       [2021/05/12 02:08:58.914]     0x7f8c2ac8bfd0:total dirty bytes: 18MB vs tracked dirty 18MB
       [2021/05/12 02:08:58.914]     0x7f8c2ac8bfd0:total updates bytes: 7MB vs tracked updates 8MB
       [2021/05/12 02:08:58.914]     [1620785336:269292][11942:0x7f8c1e51b700], eviction-server: __wt_evict_thread_run, 324: cache eviction thread error: Connection timed out
       [2021/05/12 02:08:58.914]     [1620785336:269311][11942:0x7f8c1e51b700], eviction-server: __wt_evict_thread_run, 324: the process must exit and restart: WT_PANIC: WiredTiger library panic
       [2021/05/12 02:08:58.914]     [1620785336:269323][11942:0x7f8c1e51b700], eviction-server: __wt_abort, 28: aborting WiredTiger library
      

      Nothing jumps out at me in the core dump. Other than the aborted eviction thread, there are only two other active threads. One is performing a checkpoint and appears to be actively running at the time of the failure:

      Thread 6 (Thread 0x7f8c17bcd700 (LWP 11985)):
      #0  0x000000000063f858 in __curfile_next (cursor=0x619000ba1d80) at ../src/cursor/cur_file.c:90
      #1  0x00000000006721e0 in __curhs_file_cursor_next (session=0x7f8c2ac8fb58, cursor=0x619000ba1d80) at ../src/cursor/cur_hs.c:91
      #2  0x0000000000666232 in __curhs_search_near (cursor=0x6140000f4c40, exactp=0x7f8c17bc8140) at ../src/cursor/cur_hs.c:708
      #3  0x000000000065b0c7 in __curhs_search_near_helper (session=0x7f8c2ac8fb58, cursor=0x6140000f4c40, before=false) at ../src/cursor/cur_hs.c:556
      #4  0x000000000065bdff in __wt_curhs_search_near_after (session=0x7f8c2ac8fb58, cursor=0x6140000f4c40) at ../src/cursor/cur_hs.c:540
      #5  0x0000000000d35d15 in __hs_insert_record (session=0x7f8c2ac8fb58, cursor=0x6140000f4c40, btree=0x617000001c80, key=0x604000028290, type=3 '\003', hs_value=0x604000028450, tw=0x7f8c17bc91e0) at ../src/history/hs_rec.c:183
      #6  0x0000000000d3180d in __wt_hs_insert_updates (session=0x7f8c2ac8fb58, page=0x61600011a680, multi=0x619001be2c80, cache_write_hs=0x61d000220b90) at ../src/history/hs_rec.c:638
      #7  0x0000000000827a3e in __rec_hs_wrapup (session=0x7f8c2ac8fb58, r=0x61d000220680) at ../src/reconcile/rec_write.c:2319
      #8  0x000000000080d8bb in __rec_write_wrapup (session=0x7f8c2ac8fb58, r=0x61d000220680, page=0x61600011a680) at ../src/reconcile/rec_write.c:2145
      #9  0x00000000007fc8ab in __reconcile (session=0x7f8c2ac8fb58, ref=0x60e0000fb780, salvage=0x0, flags=36, page_lockedp=0x7f8c17bcae60) at ../src/reconcile/rec_write.c:219
      #10 0x00000000007fab05 in __wt_reconcile (session=0x7f8c2ac8fb58, ref=0x60e0000fb780, salvage=0x0, flags=36) at ../src/reconcile/rec_write.c:99
      #11 0x0000000000bacbb5 in __wt_sync_file (session=0x7f8c2ac8fb58, syncop=WT_SYNC_CHECKPOINT) at ../src/btree/bt_sync.c:642
      #12 0x000000000099f2f7 in __checkpoint_tree (session=0x7f8c2ac8fb58, is_checkpoint=true, cfg=0x7f8c17bccbc0) at ../src/txn/txn_ckpt.c:1728
      #13 0x00000000009a9eeb in __checkpoint_tree_helper (session=0x7f8c2ac8fb58, cfg=0x7f8c17bccbc0) at ../src/txn/txn_ckpt.c:1836
      #14 0x00000000009a9c7d in __checkpoint_apply_to_dhandles (session=0x7f8c2ac8fb58, cfg=0x7f8c17bccbc0, op=0x9a9d50 <__checkpoint_tree_helper>) at ../src/txn/txn_ckpt.c:197
      #15 0x00000000009a21f1 in __txn_checkpoint (session=0x7f8c2ac8fb58, cfg=0x7f8c17bccbc0) at ../src/txn/txn_ckpt.c:880
      #16 0x000000000099d356 in __txn_checkpoint_wrapper (session=0x7f8c2ac8fb58, cfg=0x7f8c17bccbc0) at ../src/txn/txn_ckpt.c:1109
      #17 0x000000000099cc60 in __wt_txn_checkpoint (session=0x7f8c2ac8fb58, cfg=0x7f8c17bccbc0, waiting=true) at ../src/txn/txn_ckpt.c:1163
      #18 0x00000000008a5630 in __session_checkpoint (wt_session=0x7f8c2ac8fb58, config=0x0) at ../src/session/session_api.c:1902
      #19 0x00000000004fdf5e in checkpoint (arg=0x0) at ../../../test/format/checkpoint.c:121
      #20 0x00007f8c2a63b6db in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
      #21 0x00007f8c2976d71f in clone () from /lib/x86_64-linux-gnu/libc.so.6
      

      Another thread is opening a backup cursor but is blocked on the checkpoint lock, which makes sense as the previous thread is in checkpoint.

      Thread 7 (Thread 0x7f8c1940a700 (LWP 11982)):
      #0  0x00007f8c2a6451fd in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
      #1  0x00007f8c2a63e025 in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0
      #2  0x00000000006226f0 in __wt_spin_lock (session=0x7f8c2ac8ef70, t=0x6230000171e8) at ../src/include/mutex_inline.h:170
      #3  0x0000000000620d6f in __wt_spin_lock_track (session=0x7f8c2ac8ef70, t=0x6230000171e8) at ../src/include/mutex_inline.h:318
      #4  0x0000000000616eae in __wt_curbackup_open (session=0x7f8c2ac8ef70, uri=0xe32f40 <str> "backup:", other=0x0, cfg=0x7f8c19409820, cursorp=0x7f8c194097e0) at ../src/cursor/cur_backup.c:305
      #5  0x0000000000864f79 in __session_open_cursor_int (session=0x7f8c2ac8ef70, uri=0xe32f40 <str> "backup:", owner=0x0, other=0x0, cfg=0x7f8c19409820, hash_value=1599835161024497006, cursorp=0x7f8c194097e0) at ../src/session/session_api.c:495
      #6  0x000000000086997d in __session_open_cursor (wt_session=0x7f8c2ac8ef70, uri=0xe32f40 <str> "back--Type <RET> for more, q to quit, c to continue without paging--
      up:", to_dup=0x0, config=0x7f8c19409a20 "incremental=(enabled,granularity=16384K,this_id=1620784951)", cursorp=0x7f8c194099c0) at ../src/session/session_api.c:615
      #7  0x00000000004f432d in backup (arg=0x0) at ../../../test/format/backup.c:591
      #8  0x00007f8c2a63b6db in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
      #9  0x00007f8c2976d71f in clone () from /lib/x86_64-linux-gnu/libc.so.6
      

            Assignee:
            backlog-server-storage-engines [DO NOT USE] Backlog - Storage Engines Team
            Reporter:
            keith.smith@mongodb.com Keith Smith
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: