-
Type: Improvement
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
Replication
Right now this pool uses the default of 30s, but our thread pool type supports configuring a custom value.
The change in SERVER-56054 makes it so if we encounter a missed pthread_cond_signal in glibc we can end up waiting up to 30s to wake the threads. This is a long time and can have negative consequences on a cluster e.g. resulting in flow control engaging and increased write latency if the affected node is holding up the majority commit point (such as in a 3-node chained replication configuration where the secondary that syncs from the primary hits this bug. Note that sync source selection considers a node within 30s of the primary eligible so the lag will not prompt us to switch to sync from the primary in that situation).
We should consider lowering this value (or allowing a user to configure a lower value). Something in the realm of 10s (maybe slightly less) could be a reasonable choice to try to prevent this situation from triggering flow control.
- is related to
-
SERVER-92557 Add better diagnostics to identify cases of lost condition variable signal in oplog applier thread pool
- Open
- related to
-
SERVER-56054 Change minThreads value for replication writer thread pool to 0
- Closed