Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
- repl-shortlist

Assigned Teams:

Replication
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

Right now this pool uses the default of 30s, but our thread pool type supports configuring a custom value.

The change in ~~SERVER-56054~~ makes it so if we encounter a missed pthread_cond_signal in glibc we can end up waiting up to 30s to wake the threads. This is a long time and can have negative consequences on a cluster e.g. resulting in flow control engaging and increased write latency if the affected node is holding up the majority commit point (such as in a 3-node chained replication configuration where the secondary that syncs from the primary hits this bug. Note that sync source selection considers a node within 30s of the primary eligible so the lag will not prompt us to switch to sync from the primary in that situation).

We should consider lowering this value (or allowing a user to configure a lower value). Something in the realm of 10s (maybe slightly less) could be a reasonable choice to try to prevent this situation from triggering flow control.

is related to

SERVER-92557 Add better diagnostics to identify cases of lost condition variable signal in oplog applier thread pool

Open

related to

SERVER-56054 Change minThreads value for replication writer thread pool to 0

Closed

Assignee:: Unassigned
Reporter:: Kaitlin Mahar
Participants:: Kaitlin Mahar, Kelsey Schubert
Votes:: 0 Vote for this issue
Watchers:: 13 Start watching this issue

Created:: Jul 17 2024 06:28:01 PM UTC
Updated:: Jan 17 2025 08:26:45 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates