Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-1777

Degenerate case in wtperf run paging in

    • Type: Icon: Task Task
    • Resolution: Done
    • None
    • Affects Version/s: None
    • Component/s: None

      While running log-append.wtperf on Jenkins wtperf effectively hung in __wt_page_in_func. Although it was running the code from WT-1774, I do not believe the log code is involved in this issue.

      The test is running on the Jenkins AWS box, a m2.4xlarge, equivalent to my AWS HDD. I ended up killing it after gathering some information.

      A normal test run completes in about 169 seconds and the test.stat file looks like this:

      1687998 populate inserts (9659745 of 50000000) in 5 secs (35 total secs)
      1998511 populate inserts (11658256 of 50000000) in 5 secs (40 total secs)
      1932159 populate inserts (13590415 of 50000000) in 5 secs (45 total secs)
      1986784 populate inserts (15577199 of 50000000) in 5 secs (50 total secs)
      

      This test run hit some kind of problem very quickly and never recovered. Its test.stat file looks like:

      157894 populate inserts (157894 of 50000000) in 5 secs (5 total secs)
      503 populate inserts (158397 of 50000000) in 5 secs (10 total secs)
      504 populate inserts (158901 of 50000000) in 5 secs (15 total secs)
      506 populate inserts (159407 of 50000000) in 5 secs (20 total secs)
      ...
      500 populate inserts (222988 of 50000000) in 5 secs (655 total secs)
      512 populate inserts (223500 of 50000000) in 5 secs (660 total secs)
      504 populate inserts (224004 of 50000000) in 5 secs (665 total secs)
      

      After more than 10 minutes it had only done 224k of 50M operations. We are getting 100 operations per second!

      I ran pmp and it shows this every time. Our 8 populate threads are sleeping and nothing else is going on:

            8 select,__wt_sleep,__wt_page_in_func,__wt_row_search,__wt_btcur_insert,__curfile_insert,populate_thread,start_thread,clone
            2 
            1 select,__wt_sleep,__evict_server,start_thread,clone
            1 pthread_cond_wait@@GLIBC_2.3.2,__wt_cond_wait,__ckpt_server,start_thread,clone
            1 pthread_cond_timedwait@@GLIBC_2.3.2,__wt_cond_wait,__sweep_server,start_thread,clone
            1 pthread_cond_timedwait@@GLIBC_2.3.2,__wt_cond_wait,__log_wrlsn_server,start_thread,clone
            1 pthread_cond_timedwait@@GLIBC_2.3.2,__wt_cond_wait,__log_server,start_thread,clone
            1 pthread_cond_timedwait@@GLIBC_2.3.2,__wt_cond_wait,__log_close_server,start_thread,clone
      

      The AWS console charts show virtually no I/O going on during this test's run time. It does show significant I/O at the times of the other runs of this test.

            Assignee:
            Unassigned Unassigned
            Reporter:
            sue.loverso@mongodb.com Susan LoVerso
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: