Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-657

test/format failure: __ovfl_cache_row_visible core dump

    • Type: Icon: Task Task
    • Resolution: Done
    • WT1.6.5
    • Affects Version/s: None
    • Component/s: None

      @michaelcahill, I just saw a core dump on bengal I haven't seen before:

      Program received signal SIGSEGV, Segmentation fault.
      [Switching to Thread 0x7fffecdfa700 (LWP 4434)]
      0x00000000004a28c0 in __ovfl_cache_row_visible (session=0x8ee090, 
          page=0x7fff90007020, rip=0x7fff90007160) at ../src/btree/bt_ovfl.c:170
      170			if (__wt_txn_visible_all(session, upd->txnid))
      

      Function call stack:

      (gdb) where
      #0  0x00000000004a28c0 in __ovfl_cache_row_visible (session=0x8ee090, 
          page=0x7fff90007020, rip=0x7fff90007160) at ../src/btree/bt_ovfl.c:170
      WT-1  0x00000000004a2bee in __wt_val_ovfl_cache (session=0x8ee090, 
          page=0x7fff90007020, cookie=0x7fff90007160, unpack=0x7fffecdf97d0)
          at ../src/btree/bt_ovfl.c:317
      WT-2  0x000000000045bdc5 in __rec_row_leaf (session=0x8ee090, r=0x7fffe0007a60, 
          page=0x7fff90007020, salvage=0x0) at ../src/btree/rec_write.c:3367
      WT-3  0x0000000000455f6a in __wt_rec_write (session=0x8ee090, 
          page=0x7fff90007020, salvage=0x0, flags=0) at ../src/btree/rec_write.c:329
      WT-4  0x000000000043ee99 in __wt_sync_file (session=0x8ee090, syncop=1)
          at ../src/btree/bt_evict.c:529
      WT-5  0x000000000044c5c3 in __wt_bt_cache_op (session=0x8ee090, ckptbase=0x0, 
          op=1) at ../src/btree/bt_sync.c:64
      WT-6  0x000000000043ac19 in __wt_checkpoint_write_leaves (session=0x8ee090, 
          cfg=0x7fffecdf9c80) at ../src/txn/txn_ckpt.c:774
      

      And, in that function, upd has been overwritten with text data:

      (gdb) p upd
      $15 = (WT_UPDATE *) 0x4e4d4c4b4a494847
      (gdb) printf "%s\n", &upd
      GHIJKLMN?
      

      But the WT_UPDATE chain from first looks OK:

      (gdb) p first
      $10 = (WT_UPDATE *) 0x7fffcc010900
      (gdb) p first->next
      $11 = (WT_UPDATE *) 0x7fffac0075c0
      (gdb) p $11->next
      $12 = (WT_UPDATE *) 0x7fffb800e0a0
      (gdb) p $12->next
      $13 = (WT_UPDATE *) 0x7fffcc009ba0
      (gdb) p $13->next
      $14 = (WT_UPDATE *) 0x0
      

      So, we're:

      • writing a row-store leaf page in __rec_row_leaf, and
      • discarding an overflow value from that page,
      • we call __wt_val_ovfl_cache which acquires the btree overflow cache lock,
      • then calls __ovfl_cache_row_visible which is going to walk the list of WT_UPDATE structures for the page entry to see if there's a globally visible update.

      I'm guessing that we were walking the WT_UPDATE list and one of the structures was free'd and re-purposed, and then upd = upd->next left us with garbage in upd?

      We're not holding the serial function lock here, but that should be safe, we're not supposed to get beyond a WT_UPDATE structure that's globally visible?

            Assignee:
            Unassigned Unassigned
            Reporter:
            keith.bostic@mongodb.com Keith Bostic (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: