Loading...

Type: Bug
Resolution: Duplicate
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: DHandles
Labels:
- stability

Assigned Teams:

Storage Engines
Sprint:
StorEng - 2023-06-13, 2023-07-25 Absolute unit
Story Points:
3
Evergreen Project:
- wiredtiger
Linked BFG List:
https://buildbaron.corp.mongodb.com/ui/#/bf/WT-11133
Count of Linked BFGs (Last 30 days):
91

~~WT-10994~~ fixed a correctness issue where the sweep server was ignoring the WT_DHANDLE_DROPPED flag and thereby causing use-after-free errors.

However, that ~~WT-10994~~ fix impacted the threshold checks in test_sweep04.py, with the result that the unit-test-extra-long task (and variants thereof) are failing. The WT_DHANDLE_DROPPED change in ~~WT-10994~~ was subsequently reverted in ~~WT-11143~~ to get the Evergreen tasks passing again (but restoring the risk of the use-after-free error).

From https://evergreen.mongodb.com/task/wiredtiger_ubuntu2004_unit_test_extra_long_b8a85f4c8a53a14d9b899c098b6a90968a1f3bd1_23_05_30_02_41_41:

[2023/05/30 04:22:11.593] ======================================================================
[2023/05/30 04:22:11.593] [pid:4428]: FAIL: test_sweep04.test_sweep04.test_big_run
[2023/05/30 04:22:11.593] ----------------------------------------------------------------------
[2023/05/30 04:22:11.593] [pid:4428]: testtools.testresult.real._StringException: Traceback (most recent call last):
[2023/05/30 04:22:11.593]   File "/data/mci/e87b86bb92d8e6f597aa111c744fd30d/wiredtiger/test/suite/wttest.py", line 468, in _callTestMethod
[2023/05/30 04:22:11.593]     method()
[2023/05/30 04:22:11.593]   File "/data/mci/e87b86bb92d8e6f597aa111c744fd30d/wiredtiger/test/suite/test_sweep04.py", line 243, in test_big_run
[2023/05/30 04:22:11.593]     self.assertLess(abs(q2_slope), 20.0)
[2023/05/30 04:22:11.593] AssertionError: 100.12518607442978 not less than 20.0

And from https://evergreen.mongodb.com/task/wiredtiger_rhel80_unit_test_extra_long_b66c52a0deaf7a2f45ea40100d40b352e2a11ab8_23_05_29_03_41_40

[2023/05/29 05:18:50.222] ======================================================================
[2023/05/29 05:18:50.222] [pid:7057]: FAIL: test_sweep04.test_sweep04.test_big_run
[2023/05/29 05:18:50.222] ----------------------------------------------------------------------
[2023/05/29 05:18:50.222] [pid:7057]: testtools.testresult.real._StringException: Traceback (most recent call last):
[2023/05/29 05:18:50.222]   File "/data/mci/432fcce236978afd7e2075c6e01d463c/wiredtiger/test/suite/wttest.py", line 468, in _callTestMethod
[2023/05/29 05:18:50.222]     method()
[2023/05/29 05:18:50.222]   File "/data/mci/432fcce236978afd7e2075c6e01d463c/wiredtiger/test/suite/test_sweep04.py", line 242, in test_big_run
[2023/05/29 05:18:50.222]     self.assertLess(abs(q2_slope), q1_slope)
[2023/05/29 05:18:50.222] AssertionError: 101.36264105642258 not less than 101.27452581032414
[2023/05/29 05:18:50.222] ----------------------------------------------------------------------

Here is the tail of the results.txt file from an example failing run of test_sweep04:

    test_sweep04.test_sweep04.test_big_run: 19800/20000  stressing=False
    test_sweep04.test_sweep04.test_big_run: DHANDLE STATS: close=0, remove=10117, sweep=10101, session_handles=195650, session_sweeps=14366, sweep_tod=220067, sweep_ref=0
    test_sweep04.test_sweep04.test_big_run:   dhandle_count=10013
    test_sweep04.test_sweep04.test_big_run:   file_open=114
    test_sweep04.test_sweep04.test_big_run:   average,slope=[7779.914572864322, 46.345302268920356]
    test_sweep04.test_sweep04.test_big_run: 19900/20000  stressing=False
    test_sweep04.test_sweep04.test_big_run: DHANDLE STATS: close=0, remove=10117, sweep=10101, session_handles=195650, session_sweeps=14425, sweep_tod=220067, sweep_ref=0
    test_sweep04.test_sweep04.test_big_run:   dhandle_count=10013
    test_sweep04.test_sweep04.test_big_run:   file_open=114
    test_sweep04.test_sweep04.test_big_run:   average,slope=[7791.08, 45.9868776719418]
    test_sweep04.test_sweep04.test_big_run: 1st qtr:  average=3051.78,slope=101.6709243697479
    test_sweep04.test_sweep04.test_big_run: 2nd qtr:  average=8066.38,slope=100.74492196878751
    test_sweep04.test_sweep04.test_big_run: 3rd qtr:  average=10032.28,slope=-1.4642016806722689
    test_sweep04.test_sweep04.test_big_run: 4th qtr:  average=10013.88,slope=-0.04869147659063625
    test_sweep04.test_sweep04.test_big_run: end run: average=10013.2,slope=-0.0481203007518797
    test_sweep04.test_sweep04.test_big_run: finishing
    test_sweep04.test_sweep04.test_big_run: passed=False
    test_sweep04.test_sweep04.test_big_run: skipped=False
    test_sweep04.test_sweep04.test_big_run: preserving directory WT_TEST/test_sweep04.0
    test_sweep04.test_sweep04.test_big_run: FAIL
    test_sweep04.test_sweep04.test_big_run: preserving directory WT_TEST/test_sweep04.0
  test_sweep04.test_sweep04.test_big_run: TEST COMPLETED

Here is the tail of the result.txt file from an example passing run of test_sweep04 prior to ~~WT-10994~~.

    test_sweep04.test_sweep04.test_big_run: 19800/20000  stressing=False
    test_sweep04.test_sweep04.test_big_run: DHANDLE STATS: close=9900, remove=20006, sweep=10102, session_handles=195479, session_sweeps=12192, sweep_tod=218423, sweep_ref=1
    test_sweep04.test_sweep04.test_big_run:   dhandle_count=113
    test_sweep04.test_sweep04.test_big_run:   file_open=114
    test_sweep04.test_sweep04.test_big_run:   average,slope=[1484.4422110552764, -14.248078270138572]
    test_sweep04.test_sweep04.test_big_run: 19900/20000  stressing=False
    test_sweep04.test_sweep04.test_big_run: DHANDLE STATS: close=9900, remove=20006, sweep=10102, session_handles=195479, session_sweeps=12211, sweep_tod=218423, sweep_ref=1
    test_sweep04.test_sweep04.test_big_run:   dhandle_count=113
    test_sweep04.test_sweep04.test_big_run:   file_open=114
    test_sweep04.test_sweep04.test_big_run:   average,slope=[1477.585, -14.240113252831321]
    test_sweep04.test_sweep04.test_big_run: 1st qtr:  average=2053.6,slope=36.05887154861945
    test_sweep04.test_sweep04.test_big_run: 2nd qtr:  average=2726.72,slope=4.8072989195678275
    test_sweep04.test_sweep04.test_big_run: 3rd qtr:  average=1013.46,slope=-48.96888355342137
    test_sweep04.test_sweep04.test_big_run: 4th qtr:  average=116.56,slope=-0.31951980792316925
    test_sweep04.test_sweep04.test_big_run: end run: average=113.2,slope=-0.0481203007518797
    test_sweep04.test_sweep04.test_big_run: finishing
    test_sweep04.test_sweep04.test_big_run: passed=True
    test_sweep04.test_sweep04.test_big_run: skipped=False
  test_sweep04.test_sweep04.test_big_run: TEST COMPLETED

The passing test shows the number of dhandles dropping significantly during the test, whereas in the failing test they stay much higher thus failing the thresholds in the test.

Any dhandles that are still open are freed when the connection is closed.

The goal of this ticket is to do two things simultaneously:

close dropped dhandles as soon as practical, and
avoid the use-after-free error discussed in ~~WT-10994~~.

We know how to do each of those on their own, but doing them together is proving to be a challenge so far.

duplicates

WT-11181 heap-use-after-free during dhandle open

Closed

is duplicated by

WT-11132 Make drop table close dhandles

Closed

is related to

WT-11132 Make drop table close dhandles

Closed

WT-10576 Return EBUSY on forced drop if there is an active transaction

Closed

WT-11143 Temporary partial revert to restore Evergreen greenness

Closed

related to

SERVER-78134 disable concurrency_simultaneous_replication_wiredtiger_cursor_sweeps on macos-arm64 and macOS

Closed

(1 related to)

Details

Description

Attachments

Issue Links

Activity

People

Dates