-
Type: Task
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: Replication
-
None
-
Fully Compatible
-
v4.9, v4.4, v4.2, v4.0, v3.6
-
Repl 2021-05-03
-
(copied to CRM)
-
5
SERVER-54805 describes a case where the replication machinery on replica set secondaries ceases to make progress, with the symptom being that all threads in the replication writer thread pool are idle and the thread driving secondary replication is simultaneously blocked waiting for those writer threads to finish their work.
So far, this behavior has only manifest on systems with glibc versions susceptible to this glibc pthread condition variable bug. While I have not been able to build a minimal reproduction using our ThreadPool type, the scenario proven to exist in this blog post about using TLA+ to model glibc condition variables is perfectly analogous to how replication uses thread pools. In this scenario, a signal delivery that is lost due to the glibc bug leads to incomplete work being left in the thread pool, and no threads waking up to perform the work.
Fortunately, a low-risk workaround for this bug as it manifests in the replication system's use of ThreadPool exists. By setting minThreads to 0 instead of its current value, which is equal to maxThreads, we ensure that all waits performed by worker threads eventually wake up due to expiration of the idle thread timeout.
The task in this ticket is to change the value of minThreads in the writer thread pool used by replication to 0. This will not eliminate all possible failures due to the glibc bug, but it will eliminate the only one we've seen in practice until such time as the bug in glibc is corrected.
- is depended on by
-
SERVER-54805 Mongo become unresponsive, Spike in Connections and FD
- Closed
- is duplicated by
-
SERVER-54805 Mongo become unresponsive, Spike in Connections and FD
- Closed
-
SERVER-60164 db.serverStatus() hang on SECONDARY server #secondtime
- Closed
-
SERVER-63402 High query response time for find operation in mongo 4.0.27 with mmap storage engine with random intervals (5/7/12/20 hours)
- Closed
- is related to
-
SERVER-56784 The replication thread of secondary hang up
- Closed
-
SERVER-92554 Consider lowering maxIdleThreadAge for oplog applier thread pool
- Open
-
SERVER-92557 Add better diagnostics to identify cases of lost condition variable signal in oplog applier thread pool
- Open