As part of debugging WT-6729, I noticed behaviour where the eviction queue is full but workers aren't dequeuing pages and trying to evict them.
We frequently see this in test/format when the workload finishes. We shut down all threads and execute rollback to stable. With the changes in WT-6729, we wait until eviction quiesces which it never does due to the aforementioned bug.
This is caused by multiple eviction workers taking on the role of the eviction server. I believe the offending commit is here:
https://github.com/wiredtiger/wiredtiger/commit/01eabea0778191adee45a14aadf63a88f902345e#diff-fba173dbad7997f847117319e1939715f1969181ab2937508a2c9303b201f017
Our assumption is that since we are the only worker thread in the system, we're free to release the pass lock since there's no thread to acquire it and become the server. However, the current thread count is not an accurate reflection of what is happening in the system.
When you stop a thread in a thread group, we signal to that thread to stop by clearing WT_THREAD_ACTIVE. It does not stop immediately, it just stops after the next iteration of whatever it is doing. It could easily be at the beginning of the eviction logic in which case it will see that the pass lock is free and acquire it to become the server despite the current thread count being 1.
If I make this change, I don't see the problem anymore. Obviously, this isn't the fix because we specifically released this to avoid a deadlock but it supports the theory that the releasing of the lock is getting us in trouble.
@@ -743,9 +742,9 @@ __evict_pass(WT_SESSION_IMPL *session) * is already another server is running. */ FLD_CLR(session->lock_flags, WT_SESSION_LOCKED_PASS); - __wt_spin_unlock(session, &cache->evict_pass_lock); ret = __evict_lru_pages(session, true); - __wt_spin_lock(session, &cache->evict_pass_lock); FLD_SET(session->lock_flags, WT_SESSION_LOCKED_PASS); WT_RET(ret); }
Reproducer
- Checkout wt-6729-hs-stop-rts.
- Add an assert like so:
} - if (retries == WT_RTS_EVICT_MAX_RETRIES) + if (retries == WT_RTS_EVICT_MAX_RETRIES) { WT_ERR(__wt_msg(session, "timed out waiting for eviction to quiesce, running rts")); + WT_ASSERT(session, false); + }
- Schedule an Evergreen patch and run format stress testing.
Scope
Figure out a solution and fix the issue.
- is related to
-
WT-6729 Quiesce eviction prior running rollback to stable's active transaction check
- Closed