-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: 4.5.1, 4.4.0-rc1
-
Component/s: Replication
-
Fully Compatible
-
ALL
-
v4.4
-
-
Repl 2020-05-04
-
42
After a node has been elected primary and drained the ops from its buffer, it will check if it needs to run a reconfig to increment its config term. It does this under the replication coordinator mutex, but then releases the lock before running the reconfig. If a force reconfig is running concurrently it may install a new config with term -1 after we do this check and release our lock but before we run the reconfig. If this happens, we will then try to run a reconfig where we set the config version to the version installed by the force reconfig, and the config term to the node's current term. If the force reconfig installed version 'version' and the node's current term is 'term', then we will run a reconfig to (version, term), while our current config is (version, -1). Since we ignore terms for config comparison if either term is -1, this will not pass the validation check that the new config has a newer version and term than the current config. We will return this error and then fassert.
To address this, we may want to consider preventing force reconfigs from running concurrently with a node while in drain mode. For non force reconfigs, we should already prevent this since we check canAcceptNonLocalWrites, but we bypass these checks for force reconfigs, since they can run on a secondary.
- related to
-
SERVER-47142 Check primary before writing replset config and no-op
- Closed