Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-11133

Correctly drop dhandles to avoid use-after-free error

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: DHandles
    • 3
    • StorEng - 2023-06-13, 2023-07-25 Absolute unit

      WT-10994 fixed a correctness issue where the sweep server was ignoring the WT_DHANDLE_DROPPED flag and thereby causing use-after-free errors.

      However, that WT-10994 fix impacted the threshold checks in test_sweep04.py, with the result that the unit-test-extra-long task (and variants thereof) are failing. The WT_DHANDLE_DROPPED change in WT-10994  was subsequently reverted in WT-11143 to get the Evergreen tasks passing again (but restoring the risk of the use-after-free error).

      From https://evergreen.mongodb.com/task/wiredtiger_ubuntu2004_unit_test_extra_long_b8a85f4c8a53a14d9b899c098b6a90968a1f3bd1_23_05_30_02_41_41: 

      [2023/05/30 04:22:11.593] ======================================================================
      [2023/05/30 04:22:11.593] [pid:4428]: FAIL: test_sweep04.test_sweep04.test_big_run
      [2023/05/30 04:22:11.593] ----------------------------------------------------------------------
      [2023/05/30 04:22:11.593] [pid:4428]: testtools.testresult.real._StringException: Traceback (most recent call last):
      [2023/05/30 04:22:11.593]   File "/data/mci/e87b86bb92d8e6f597aa111c744fd30d/wiredtiger/test/suite/wttest.py", line 468, in _callTestMethod
      [2023/05/30 04:22:11.593]     method()
      [2023/05/30 04:22:11.593]   File "/data/mci/e87b86bb92d8e6f597aa111c744fd30d/wiredtiger/test/suite/test_sweep04.py", line 243, in test_big_run
      [2023/05/30 04:22:11.593]     self.assertLess(abs(q2_slope), 20.0)
      [2023/05/30 04:22:11.593] AssertionError: 100.12518607442978 not less than 20.0 

      And from https://evergreen.mongodb.com/task/wiredtiger_rhel80_unit_test_extra_long_b66c52a0deaf7a2f45ea40100d40b352e2a11ab8_23_05_29_03_41_40 

      [2023/05/29 05:18:50.222] ======================================================================
      [2023/05/29 05:18:50.222] [pid:7057]: FAIL: test_sweep04.test_sweep04.test_big_run
      [2023/05/29 05:18:50.222] ----------------------------------------------------------------------
      [2023/05/29 05:18:50.222] [pid:7057]: testtools.testresult.real._StringException: Traceback (most recent call last):
      [2023/05/29 05:18:50.222]   File "/data/mci/432fcce236978afd7e2075c6e01d463c/wiredtiger/test/suite/wttest.py", line 468, in _callTestMethod
      [2023/05/29 05:18:50.222]     method()
      [2023/05/29 05:18:50.222]   File "/data/mci/432fcce236978afd7e2075c6e01d463c/wiredtiger/test/suite/test_sweep04.py", line 242, in test_big_run
      [2023/05/29 05:18:50.222]     self.assertLess(abs(q2_slope), q1_slope)
      [2023/05/29 05:18:50.222] AssertionError: 101.36264105642258 not less than 101.27452581032414
      [2023/05/29 05:18:50.222] ---------------------------------------------------------------------- 

      Here is the tail of the results.txt file from an example failing run of test_sweep04:

          test_sweep04.test_sweep04.test_big_run: 19800/20000  stressing=False
          test_sweep04.test_sweep04.test_big_run: DHANDLE STATS: close=0, remove=10117, sweep=10101, session_handles=195650, session_sweeps=14366, sweep_tod=220067, sweep_ref=0
          test_sweep04.test_sweep04.test_big_run:   dhandle_count=10013
          test_sweep04.test_sweep04.test_big_run:   file_open=114
          test_sweep04.test_sweep04.test_big_run:   average,slope=[7779.914572864322, 46.345302268920356]
          test_sweep04.test_sweep04.test_big_run: 19900/20000  stressing=False
          test_sweep04.test_sweep04.test_big_run: DHANDLE STATS: close=0, remove=10117, sweep=10101, session_handles=195650, session_sweeps=14425, sweep_tod=220067, sweep_ref=0
          test_sweep04.test_sweep04.test_big_run:   dhandle_count=10013
          test_sweep04.test_sweep04.test_big_run:   file_open=114
          test_sweep04.test_sweep04.test_big_run:   average,slope=[7791.08, 45.9868776719418]
          test_sweep04.test_sweep04.test_big_run: 1st qtr:  average=3051.78,slope=101.6709243697479
          test_sweep04.test_sweep04.test_big_run: 2nd qtr:  average=8066.38,slope=100.74492196878751
          test_sweep04.test_sweep04.test_big_run: 3rd qtr:  average=10032.28,slope=-1.4642016806722689
          test_sweep04.test_sweep04.test_big_run: 4th qtr:  average=10013.88,slope=-0.04869147659063625
          test_sweep04.test_sweep04.test_big_run: end run: average=10013.2,slope=-0.0481203007518797
          test_sweep04.test_sweep04.test_big_run: finishing
          test_sweep04.test_sweep04.test_big_run: passed=False
          test_sweep04.test_sweep04.test_big_run: skipped=False
          test_sweep04.test_sweep04.test_big_run: preserving directory WT_TEST/test_sweep04.0
          test_sweep04.test_sweep04.test_big_run: FAIL
          test_sweep04.test_sweep04.test_big_run: preserving directory WT_TEST/test_sweep04.0
        test_sweep04.test_sweep04.test_big_run: TEST COMPLETED 

      Here is the tail of the result.txt file from an example passing run of test_sweep04 prior to WT-10994.

          test_sweep04.test_sweep04.test_big_run: 19800/20000  stressing=False
          test_sweep04.test_sweep04.test_big_run: DHANDLE STATS: close=9900, remove=20006, sweep=10102, session_handles=195479, session_sweeps=12192, sweep_tod=218423, sweep_ref=1
          test_sweep04.test_sweep04.test_big_run:   dhandle_count=113
          test_sweep04.test_sweep04.test_big_run:   file_open=114
          test_sweep04.test_sweep04.test_big_run:   average,slope=[1484.4422110552764, -14.248078270138572]
          test_sweep04.test_sweep04.test_big_run: 19900/20000  stressing=False
          test_sweep04.test_sweep04.test_big_run: DHANDLE STATS: close=9900, remove=20006, sweep=10102, session_handles=195479, session_sweeps=12211, sweep_tod=218423, sweep_ref=1
          test_sweep04.test_sweep04.test_big_run:   dhandle_count=113
          test_sweep04.test_sweep04.test_big_run:   file_open=114
          test_sweep04.test_sweep04.test_big_run:   average,slope=[1477.585, -14.240113252831321]
          test_sweep04.test_sweep04.test_big_run: 1st qtr:  average=2053.6,slope=36.05887154861945
          test_sweep04.test_sweep04.test_big_run: 2nd qtr:  average=2726.72,slope=4.8072989195678275
          test_sweep04.test_sweep04.test_big_run: 3rd qtr:  average=1013.46,slope=-48.96888355342137
          test_sweep04.test_sweep04.test_big_run: 4th qtr:  average=116.56,slope=-0.31951980792316925
          test_sweep04.test_sweep04.test_big_run: end run: average=113.2,slope=-0.0481203007518797
          test_sweep04.test_sweep04.test_big_run: finishing
          test_sweep04.test_sweep04.test_big_run: passed=True
          test_sweep04.test_sweep04.test_big_run: skipped=False
        test_sweep04.test_sweep04.test_big_run: TEST COMPLETED 

      The passing test shows the number of dhandles dropping significantly during the test, whereas in the failing test they stay much higher thus failing the thresholds in the test.

      Any dhandles that are still open are freed when the connection is closed.

      The goal of this ticket is to do two things simultaneously:

      • close dropped dhandles as soon as practical, and
      • avoid the use-after-free error discussed in WT-10994.

      We know how to do each of those on their own, but doing them together is proving to be a challenge so far.

            Assignee:
            jeremy.thorp@mongodb.com Jeremy Thorp
            Reporter:
            jeremy.thorp@mongodb.com Jeremy Thorp
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

              Created:
              Updated:
              Resolved: