Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-9121

Eviction server thread always reset the traversing point when exiting -- parts of the B-tree may never be traversed by eviction in some simple cases/

    • Storage Engines
    • 8
    • c(3x10^8)-StorEng - 2023-11-14, 2023-11-28 - Anthill Tiger, 2023-12-12 - Heisenbug, 2024-01-09 - I Grew Tired, StorEng - 2024-01-23, 2024-02-06 tapioooooooooooooca, 2024-02-20_A_near-death_puffin, 2024-03-05 - Claronald, 2024-03-19 - PacificOcean

      Summary
      The eviction server thread, when exiting, will clear btree->evict_ref, which records the current point in traversing this btree, by calling __evict_clear_walk() when finishing this eviction run. This means the next background eviction thread would traverse the btree from the beginning, and if it finishes before scanning through the whole tree, the next one would again start from the beginning and parts of the tree would never be scanned for eviction.

       

      This means the current eviction mechanism cannot adapt to certain access patterns in certain (but common) conditions and fail to decide hotness of data correctly and evict hot data from cache. Workloads focusing on former parts of the tree would be much slower than those focusing on latter parts because former parts are always scanned and evicted from cache while latter parts are persistent in cache.

      Reproduction and Consequence

      This issue occur in the following setup:

      1. I write an application that uses WiredTiger directly.
      2. The database is using a single B-tree, sequentially filling 10,000,000 key value pairs of (16 + 72) bytes resulting in a ~1G uncompressed database size. With compression enabled (snappy compression), the on-disk file size is ~500M.
      3. Total available memory for the application is 1G, and the cache size of WiredTiger is set to 820000KB (but actually only ~700M is used by the application; this may be another issue.)  The workload runs 2,000,000 get requests in a single thread, either uniformly random, or focusing (>80% of all requests) on the first half of keys, or focusing on the second half. The total latency differs a lot (see the table below). The reason is described as above, that the first half pages in the tree are always chosen for eviction regardless of the workload.
      4. Most parameters including those related to eviction, except for the cache size, are default.

      I think any read workload with a single large btree and mild cache pressure (so the eviction thread finishes working before scanning through the whole tree) and penalizes cache miss heavily (e.g. by decompression cost) would reproduce similar results.

      I have modified the code and verified that only the first half pages in the tree are frequently evicted from the cache regardless of the access pattern.

      This behavior is checked to be true on WiredTiger version 10.0.0 (the current master branch) and 10.0.2 (the current develop branch)

       

      Workload Focusing on first half Uniform Random Focusing on second half
      Latency 47s 36s 14s

       

        1. 1st.png
          857 kB
          Sulabh Mahajan
        2. 2nd.png
          662 kB
          Sulabh Mahajan
        3. cxxopts.hpp
          44 kB
          Yifan Dai
        4. image-2023-11-30-07-07-52-326.png
          36 kB
          Yury Ershov
        5. random-descent-no-memory-histogram.png
          70 kB
          Yury Ershov
        6. random-descent-no-memory-timeline.png
          37 kB
          Yury Ershov
        7. random-descent-no-memory-timeline-zoom.png
          14 kB
          Yury Ershov
        8. random-descent-timeline.png
          26 kB
          Yury Ershov
        9. random-rescent-histogram.png
          72 kB
          Yury Ershov
        10. screenshot-1.png
          286 kB
          Yury Ershov
        11. screenshot-10.png
          27 kB
          Yury Ershov
        12. screenshot-11.png
          20 kB
          Yury Ershov
        13. screenshot-2.png
          864 kB
          Yury Ershov
        14. screenshot-3.png
          252 kB
          Yury Ershov
        15. screenshot-4.png
          137 kB
          Yury Ershov
        16. screenshot-5.png
          120 kB
          Yury Ershov
        17. screenshot-6.png
          19 kB
          Yury Ershov
        18. screenshot-7.png
          19 kB
          Yury Ershov
        19. screenshot-8.png
          37 kB
          Yury Ershov
        20. screenshot-9.png
          27 kB
          Yury Ershov
        21. screenshot-latency-50.png
          158 kB
          Yury Ershov
        22. screenshot-latency-95.png
          155 kB
          Yury Ershov
        23. screenshot-throughput.png
          150 kB
          Yury Ershov
        24. simple_read.cpp
          6 kB
          Yifan Dai
        25. WiredTigerStat.29.06
          4.87 MB
          Yury Ershov
        26. wt_trace_eviction-8.png
          61 kB
          Yifan Dai

            Assignee:
            y.ershov@mongodb.com Yury Ershov
            Reporter:
            ydai59@wisc.edu Yifan Dai
            Votes:
            0 Vote for this issue
            Watchers:
            22 Start watching this issue

              Created:
              Updated:
              Resolved: