Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-17587

Node crash scenario results in uncrecoverable error on subsequent startup under WiredTiger

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Critical - P2 Critical - P2
    • 3.0.2, 3.1.1
    • Affects Version/s: 3.0.1
    • Component/s: WiredTiger
    • Fully Compatible
    • ALL

      Issue Status as of Apr 02, 2015

      ISSUE SUMMARY
      An interaction between the validation operation with the WiredTiger storage engine and the underlying system power cycling leads to a situation where MongoDB refuses to start because of data corruption.

      USER IMPACT
      mongod may refuse to start due to data corruption caused by the power cyle.

      WORKAROUNDS
      Omit use of validate.

      AFFECTED VERSIONS
      3.0.1

      FIX VERSION
      The fix is included in the 3.0.2 production release.

      Original description

      Reproduce as follows. WARNING: this code will crash your machine.

      echo === create db
      rm -rf db; mkdir -p db
      mongod --dbpath db --storageEngine wiredTiger --logpath db.log --fork --syncdelay 0
      sleep 1; sync # work around SERVER-17571
      mongo --quiet --eval 'db.c.insert({x: 0})'
      
      echo === shut down
      killall -w mongod
      
      echo === validate/update loop
      mongod --dbpath db --storageEngine wiredTiger --logpath db.log --fork --syncdelay 0
      mongo --quiet --eval '
          for (var i=0; i<10; i++) {
              db.c.validate()
              db.c.update({}, {$inc: {x: 1}})
          }
      '
      
      # crash
      sudo bash -c 'echo 1 >/proc/sys/kernel/sysrq'
      sudo bash -c 'echo b >/proc/sysrq-trigger'
      

      On subsequent startup recovery fails and mongod terminates:

      2015-03-13T12:24:21.255-0400 E STORAGE  [initandlisten] WiredTiger (0) [1426263861:255456][2012:0x7f8fbc47cb80], file:collection-2-8220618549132206910.wt, session.open_cursor: read checksum error [4096B @ 8192, 3367698403 != 1510178857]
      2015-03-13T12:24:21.255-0400 E STORAGE  [initandlisten] WiredTiger (0) [1426263861:255578][2012:0x7f8fbc47cb80], file:collection-2-8220618549132206910.wt, session.open_cursor: collection-2-8220618549132206910.wt: encountered an illegal file format or internal value
      2015-03-13T12:24:21.255-0400 E STORAGE  [initandlisten] WiredTiger (-31804) [1426263861:255598][2012:0x7f8fbc47cb80], file:collection-2-8220618549132206910.wt, session.open_cursor: the process must exit and restart: WT_PANIC: WiredTiger library panic
      2015-03-13T12:24:21.255-0400 I -        [initandlisten] Fatal Assertion 28558
      2015-03-13T12:24:21.269-0400 I CONTROL  [initandlisten] 
       0xf4fd39 0xef9f81 0xeddb71 0xd78fda 0x13807f0 0x1380ab5 0x1380f54 0x12d9112 0x12f0ade 0x12ee878 0x12ef752 0x131779b 0x137fe68 0x137ff7a 0x1324cf1 0x137e26a 0x1335e33 0x137e0b1 0x137e4c5 0xd73f86 0xd6d4cb 0xd67264 0xd67bd3 0xd67ed9 0xd61d32 0xce21a6 0xce52dc 0xd60ba6 0xa6f8bd 0x7e20c0 0x7e7704 0x7f8fbb07afe0 0x7e02c9
      ----- BEGIN BACKTRACE -----
      {"backtrace":[{"b":"400000","o":"B4FD39"},{"b":"400000","o":"AF9F81"},{"b":"400000","o":"ADDB71"},{"b":"400000","o":"978FDA"},{"b":"400000","o":"F807F0"},{"b":"400000","o":"F80AB5"},{"b":"400000","o":"F80F54"},{"b":"400000","o":"ED9112"},{"b":"400000","o":"EF0ADE"},{"b":"400000","o":"EEE878"},{"b":"400000","o":"EEF752"},{"b":"400000","o":"F1779B"},{"b":"400000","o":"F7FE68"},{"b":"400000","o":"F7FF7A"},{"b":"400000","o":"F24CF1"},{"b":"400000","o":"F7E26A"},{"b":"400000","o":"F35E33"},{"b":"400000","o":"F7E0B1"},{"b":"400000","o":"F7E4C5"},{"b":"400000","o":"973F86"},{"b":"400000","o":"96D4CB"},{"b":"400000","o":"967264"},{"b":"400000","o":"967BD3"},{"b":"400000","o":"967ED9"},{"b":"400000","o":"961D32"},{"b":"400000","o":"8E21A6"},{"b":"400000","o":"8E52DC"},{"b":"400000","o":"960BA6"},{"b":"400000","o":"66F8BD"},{"b":"400000","o":"3E20C0"},{"b":"400000","o":"3E7704"},{"b":"7F8FBB05B000","o":"1FFE0"},{"b":"400000","o":"3E02C9"}],"processInfo":{ "mongodbVersion" : "3.0.1-rc0", "gitVersion" : "3f999d764cdd307f779b27e3a28230b21bd43e25", "uname" : { "sysname" : "Linux", "release" : "3.17.4-301.fc21.x86_64", "version" : "#1 SMP Thu Nov 27 19:09:10 UTC 2014", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000" }, { "b" : "7FFF0FBFE000", "elfType" : 3 }, { "b" : "7F8FBC052000", "path" : "/lib64/libpthread.so.0", "elfType" : 3 }, { "b" : "7F8FBBE4A000", "path" : "/lib64/librt.so.1", "elfType" : 3 }, { "b" : "7F8FBBC46000", "path" : "/lib64/libdl.so.2", "elfType" : 3 }, { "b" : "7F8FBB937000", "path" : "/lib64/libstdc++.so.6", "elfType" : 3 }, { "b" : "7F8FBB62F000", "path" : "/lib64/libm.so.6", "elfType" : 3 }, { "b" : "7F8FBB418000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3 }, { "b" : "7F8FBB05B000", "path" : "/lib64/libc.so.6", "elfType" : 3 }, { "b" : "7F8FBC26E000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3 } ] }}
       mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf4fd39]
       mongod(_ZN5mongo10logContextEPKc+0xE1) [0xef9f81]
       mongod(_ZN5mongo13fassertFailedEi+0x61) [0xeddb71]
       mongod(+0x978FDA) [0xd78fda]
       mongod(+0xF807F0) [0x13807f0]
       mongod(__wt_err+0x95) [0x1380ab5]
       mongod(__wt_panic+0x24) [0x1380f54]
       mongod(__wt_bm_read+0x72) [0x12d9112]
       mongod(__wt_bt_read+0x7E) [0x12f0ade]
       mongod(__wt_btree_tree_open+0x58) [0x12ee878]
       mongod(__wt_btree_open+0xD72) [0x12ef752]
       mongod(__wt_conn_btree_get+0x19B) [0x131779b]
       mongod(__wt_session_get_btree+0x2D8) [0x137fe68]
       mongod(__wt_session_get_btree_ckpt+0xCA) [0x137ff7a]
       mongod(__wt_curfile_open+0xE1) [0x1324cf1]
       mongod(__wt_open_cursor+0x26A) [0x137e26a]
       mongod(__wt_curtable_open+0x2E3) [0x1335e33]
       mongod(__wt_open_cursor+0xB1) [0x137e0b1]
       mongod(+0xF7E4C5) [0x137e4c5]
       mongod(_ZN5mongo17WiredTigerSession9getCursorERKSsmb+0x156) [0xd73f86]
       mongod(_ZN5mongo16WiredTigerCursorC1ERKSsmbPNS_16OperationContextE+0x4B) [0xd6d4cb]
       mongod(_ZN5mongo21WiredTigerRecordStore8IteratorC2ERKS0_PNS_16OperationContextERKNS_8RecordIdERKNS_20CollectionScanParams9DirectionEb+0x64) [0xd67264]
       mongod(_ZNK5mongo21WiredTigerRecordStore11getIteratorEPNS_16OperationContextERKNS_8RecordIdERKNS_20CollectionScanParams9DirectionE+0x43) [0xd67bd3]
       mongod(_ZN5mongo21WiredTigerRecordStoreC1EPNS_16OperationContextERKNS_10StringDataES5_bllPNS_28CappedDocumentDeleteCallbackEPNS_20WiredTigerSizeStorerE+0x2B9) [0xd67ed9]
       mongod(_ZN5mongo18WiredTigerKVEngine14getRecordStoreEPNS_16OperationContextERKNS_10StringDataES5_RKNS_17CollectionOptionsE+0x132) [0xd61d32]
       mongod(_ZN5mongo22KVDatabaseCatalogEntry14initCollectionEPNS_16OperationContextERKSsb+0x276) [0xce21a6]
       mongod(_ZN5mongo15KVStorageEngineC1EPNS_8KVEngineERKNS_22KVStorageEngineOptionsE+0x69C) [0xce52dc]
       mongod(+0x960BA6) [0xd60ba6]
       mongod(_ZN5mongo23GlobalEnvironmentMongoD22setGlobalStorageEngineERKSs+0x30D) [0xa6f8bd]
       mongod(_ZN5mongo13initAndListenEi+0x2F0) [0x7e20c0]
       mongod(main+0x134) [0x7e7704]
       libc.so.6(__libc_start_main+0xF0) [0x7f8fbb07afe0]
       mongod(+0x3E02C9) [0x7e02c9]
      

      Note that this affects 3.0.1-rc0 and not 3.0.0, so it appears to be a regression. This scenario seems to require the validate(), so I suspect it could be related to the new functionality in SERVER-17481 relating to validate - either in WT or in the integration layer.

            Assignee:
            michael.cahill@mongodb.com Michael Cahill (Inactive)
            Reporter:
            bruce.lucas@mongodb.com Bruce Lucas (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            14 Start watching this issue

              Created:
              Updated:
              Resolved: