Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 8.0.0-rc0
Affects Version/s: None
Component/s: None
Labels:
None

Assigned Teams:

Service Arch
Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Sprint:
Execution Team 2024-03-04
Linked BF Score:
34
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

TLDR: With the current ThreadPool implementations, the mapping between `_numIdleThreads` and `_threads` size is not tracked correctly. Due to this inconsistency, ThreadPool::waitForIdle() might occasionally hang or return prematurely without draining the pending tasks when ran concurrently with ThreadPool::join(). Found this issue while doing code inspection.

For instance,Consider the following scenario: Let's say the thread pool size is 2. Now, when the thread pool shuts down with pending tasks remaining. In that case, worker threads lends hand in draining those tasks. They go through this special code path that could result in _numIdleThreads to 0 but with _threads size as 2. Meanwhile, if a join() is also invoked, it spawns a cleanup thread to drain the pending tasks. However, this cleanup thread gets tracked in _numIdleThreads but not in _threads. If the cleanup thread happens to be the one finishing the last pending task, a situation may arise where _numIdleThreads could be 1 (accounting the cleanup thread), but the thread size could still be 2 (accounting worker threads), resulting in not signaling _poolIsIdle CV . If waitForIdle() races with join() (called after this point, could lead to a hang.

Note that both the cleanup thread and worker threads concurrently drain pending tasks upon thread pool shutdown. Consequently, the cleanup thread might incorrectly perceive all pending tasks are drained, even though the worker threads are simultaneously processing them. This could cause the _threads size to reset. Consequently, f the worker thread happens to be the one finishing the last pending task, we could end up having _numIdleThreads be 2 (accounting worker threads) but thread size could be 0, resulting in waitForIdle() hanging.

As _numIdleThreads includes both the cleanupThread and the worker threads, whereas _threads only includes the worker threads, we may also make waitForIdle() to return prematurely before draining all the pending tasks.

I noticed that we addressed a similar hanging issue with ~~SERVER-53477~~. However, that fix was specific to that case and didn't address the root cause. It'a also incorrect because waitForIdle() could return prematurely without draining the pending tasks.

Implications: Tenant migration hang mentioned in ~~SERVER-53477~~ can still happen, affecting Shard merge as well.

is related to

SERVER-53477 ThreadPool::waitForIdle should be interruptible on thread pool shutdown()

Closed

related to

SERVER-87327 remove unused TaskRunner kKeepOperationContext behavior

Closed

Assignee:: Suganthi Mani
Reporter:: Suganthi Mani
Participants:: Githook User, Suganthi Mani
Votes:: 0 Vote for this issue
Watchers:: 3 Start watching this issue

Created:: Feb 27 2024 02:02:52 AM UTC
Updated:: Mar 08 2024 01:17:00 AM UTC
Resolved:: Mar 01 2024 02:54:39 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates