-
Type: Task
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
Replication
While investigating how the replica set config is written and read for SERVER-96005, we noticed a potential race condition in the config access pattern from higher layers.
Currently, replica set config reads are done lock-free (SERVER-89631). When calling getConfig() in the replication coordinator, the function will check if the cached config is stale, and if so, will update the cached value. This getter is used in the replication coordinator config field-specific getters. The config itself is updated under the replication coordinator mutex. This approach ensures that callers of these config field getters always see the most recent value. Within replication, there is a getConfig(WithLock) function that allows callers to ensure that fields remain unchanged until the lock is released. If a function relies on a particular config field staying the same, we should use this function and just read the desired field off the config object reference.
However, we have cases where higher layers rely on replica set config fields and use the getter functions to retrieve them. Here is an example in addShard. This validates the hostname field on the replica set members fields, but nothing prevents this command from racing with a concurrent reconfig that modifies this field.
This ticket is to think broadly about how we can refactor our config API so that we limit higher layer access to only the most necessary fields, and address this time-of-check to time-of-use bug in the process.
- related to
-
SERVER-96142 SharedReplSetConfig::setConfig() should use WithLock
- Backlog
-
SERVER-96005 Delete unused getConfig* methods
- Closed
-
SERVER-89631 Make reading `ReplSetConfig` mostly lock-free
- Closed