Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-17304

rc8 on CentOs 6.6 crashed under heavy r/w workload

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.0.0-rc8
    • Component/s: Internal Code
    • None
    • ALL
    • Hide

      On the box:
      cd /home/buzz/
      sh runt stateEvent --port 37017 -d test5 --numevents 2000000 --posn 100000 --threads 32 --drop --payload 8192

      Show
      On the box: cd /home/buzz/ sh runt stateEvent --port 37017 -d test5 --numevents 2000000 --posn 100000 --threads 32 --drop --payload 8192

      This is the big IBM X6 box:

      buzz@IDF-IBM-Test-1 ~]$ uname -a
      Linux IDF-IBM-Test-1.10gen.cc 2.6.32-431.el6.x86_64 #1 SMP Fri Nov 22 03:15:09 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
      [buzz@IDF-IBM-Test-1 ~]$ cat /etc/re
      readahead.conf    redhat-lsb/       redhat-release    request-key.conf  request-key.d/    resolv.conf
      [buzz@IDF-IBM-Test-1 ~]$ cat /etc/redhat-release 
      CentOS release 6.6 (Final)
      [buzz@IDF-IBM-Test-1 ~]$ df
      Filesystem           1K-blocks     Used Available Use% Mounted on
      /dev/mapper/vg_idfibmtest1-lv_root
                            51606140 11926496  37058204  25% /
      tmpfs                396812172        0 396812172   0% /dev/shm
      /dev/sda1               495844    96335    373909  21% /boot
      /dev/mapper/vg_idfibmtest1-lv_home
                           231167100   832892 218591592   1% /home
      /dev/md1             768924576 20997468 708868020   3% /data/1
      /dev/md2             768924576 10687204 719178284   2% /data/2
      /dev/md3             768924576   201440 729664048   1% /data/3
      /dev/md4             768924576  3242900 726622588   1% /data/4
      
      top - 06:18:58 up 5 days, 20:39,  3 users,  load average: 0.00, 0.00, 0.07
      Tasks: 1814 total,   1 running, 1813 sleeping,   0 stopped,   0 zombie
      Cpu(s):  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
      Mem:  793624344k total, 10494188k used, 783130156k free,   270752k buffers
      Swap:  4194296k total,        0k used,  4194296k free,  4892660k cached
      

      The rc8 server is vanilla WiredTiger with no special options:

      numactl --interleave=all /home/buzz/3.0.0-rc8/bin/mongod --storageEngine=wiredTiger --port 37017 --dbpath /data/4/data0/db0 --logpath /tmp/mongo0.log --fork
      

      Program is performance test. It generates load for a state-event model like this:
      1. Find an item in the currentPos collection using findAndModify to "logically reserve" it
      2. Copy it to historicPos collection
      3. Insert an event doc into the events collection
      4 findAndModify the same item in currentPos to write updated data
      There are 32 threads. Each thread has a loop that randomly picks a position key and goes thru the steps outlined above.

      Failure occurs about 90% of the time between 3000 and 7000 turns of the crank (1-4 above).

      This seems to be the start of trouble:

      2015-02-17T05:54:22.306-0500 F REPL     [conn27] Rollback of logOp not currently allowed (SERVER-15192)
      2015-02-17T05:54:22.306-0500 I -        [conn27] Fatal Assertion 18805
      2015-02-17T05:54:22.318-0500 I COMMAND  [conn22] command test5.$cmd command: findAndModify { findandmodify: "currentPos", query: { k: "P5507", r: 1 }, update: { $inc: { n: 1 }, $set: { val: 121, ts: new Date(1424170462154), r: 0 } }, upsert: true } update: { $inc: { n: 1 }, $set: { val: 121, ts: new Date(1424170462154), r: 0 } } nscanned:1 nscannedObjects:1 nMatched:1 nModified:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:8361 163ms
      2015-02-17T05:54:22.320-0500 I COMMAND  [conn30] command test5.$cmd command: findAndModify { findandmodify: "currentPos", query: { k: "P48449", r: 0 }, update: { $set: { r: 1, ts: new Date(1424170462156), x: BinData(0, 02654E522D2D326778F31B1BA8B6A8EF889742FAFEC2AA305660EADF51F63400EC4832AABD4D264BAFBCF357C98DD4BE590A18CB892BB02F009BCD4C7D1BF08FF6023526BA33...) } }, new: true, upsert: true } update: { $set: { r: 1, ts: new Date(1424170462156), x: BinData(0, 02654E522D2D326778F31B1BA8B6A8EF889742FAFEC2AA305660EADF51F63400EC4832AABD4D264BAFBCF357C98DD4BE590A18CB892BB02F009BCD4C7D1BF08FF6023526BA33...) } } nscanned:1 nscannedObjects:1 nMatched:1 nModified:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:8362 163ms
      2015-02-17T05:54:22.320-0500 I CONTROL  [conn27] 
       0xf3f059 0xee9191 0xecd5b1 0xc247ac 0xd60996 0x7e42f6 0x94218e 0x9429de 0x9b9434 0x9ba373 0x9baf6b 0xb8abe5 0xa9c968 0x7e6cc0 0xefd11b 0x30ba2079d1 0x30b9ee88fd
      ----- BEGIN BACKTRACE -----
      {"backtrace":[{"b":"400000","o":"B3F059"},{"b":"400000","o":"AE9191"},{"b":"400000","o":"ACD5B1"},{"b":"400000","o":"8247AC"},{"b":"400000","o":"960996"},{"b":"400000","o":"3E42F6"},{"b":"400000","o":"54218E"},{"b":"400000","o":"5429DE"},{"b":"400000","o":"5B9434"},{"b":"400000","o":"5BA373"},{"b":"400000","o":"5BAF6B"},{"b":"400000","o":"78ABE5"},{"b":"400000","o":"69C968"},{"b":"400000","o":"3E6CC0"},{"b":"400000","o":"AFD11B"},{"b":"30BA200000","o":"79D1"},{"b":"30B9E00000","o":"E88FD"}],"processInfo":{ "mongodbVersion" : "3.0.0-rc8", "gitVersion" : "9d0714cdcffadb203ff68730acedbc40875403da", "uname" : { "sysname" : "Linux", "release" : "2.6.32-431.el6.x86_64", "version" : "#1 SMP Fri Nov 22 03:15:09 UTC 2013", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000" }, { "b" : "7FFFEEEFF000", "elfType" : 3 }, { "path" : "/lib64/libpthread.so.0", "elfType" : 3 }, { "path" : "/lib64/librt.so.1", "elfType" : 3 }, { "path" : "/lib64/libdl.so.2", "elfType" : 3 }, { "path" : "/usr/lib64/libstdc++.so.6", "elfType" : 3 }, { "path" : "/lib64/libm.so.6", "elfType" : 3 }, { "path" : "/lib64/libgcc_s.so.1", "elfType" : 3 }, { "path" : "/lib64/libc.so.6", "elfType" : 3 }, { "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3 } ] }}
       mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf3f059]
       mongod(_ZN5mongo10logContextEPKc+0xE1) [0xee9191]
       mongod(_ZN5mongo13fassertFailedEi+0x61) [0xecd5b1]
       mongod(+0x8247AC) [0xc247ac]
       mongod(_ZN5mongo22WiredTigerRecoveryUnit6_abortEv+0x36) [0xd60996]
       mongod(_ZN5mongo15WriteUnitOfWorkD1Ev+0x16) [0x7e42f6]
       mongod(_ZN5mongo16CmdFindAndModify7runImplEPNS_16OperationContextERKSsS4_RKNS_7BSONObjES7_S7_S7_bbbRNS_14BSONObjBuilderERSs+0x18AE) [0x94218e]
       mongod(_ZN5mongo16CmdFindAndModify3runEPNS_16OperationContextERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x2AE) [0x9429de]
       mongod(_ZN5mongo12_execCommandEPNS_16OperationContextEPNS_7CommandERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x34) [0x9b9434]
       mongod(_ZN5mongo7Command11execCommandEPNS_16OperationContextEPS0_iPKcRNS_7BSONObjERNS_14BSONObjBuilderEb+0xC13) [0x9ba373]
       mongod(_ZN5mongo12_runCommandsEPNS_16OperationContextEPKcRNS_7BSONObjERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x28B) [0x9baf6b]
       mongod(_ZN5mongo8runQueryEPNS_16OperationContextERNS_7MessageERNS_12QueryMessageERKNS_15NamespaceStringERNS_5CurOpES3_b+0x755) [0xb8abe5]
       mongod(_ZN5mongo16assembleResponseEPNS_16OperationContextERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortEb+0xAC8) [0xa9c968]
       mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0xE0) [0x7e6cc0]
       mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x32B) [0xefd11b]
       libpthread.so.0(+0x79D1) [0x30ba2079d1]
       libc.so.6(clone+0x6D) [0x30b9ee88fd]
      -----  END BACKTRACE  -----
      2015-02-17T05:54:22.320-0500 I -        [conn27] 
      
      ***aborting after fassert() failure
      

      Interesting note: Same test program (but 16 threads instead of 32) ran OK all day yesterday (start, stop, reset, over and over) on a r3.4xlarge AWS instance running ami-d13845e1.

            Assignee:
            mathias@mongodb.com Mathias Stearn
            Reporter:
            buzz.moschetti Buzz Moschetti
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: