-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: Replication
-
None
-
Fully Compatible
-
ALL
-
v4.4, v4.2, v4.0
-
16
If a node receives a heartbeat reconfig and can't find itself in the config due to a network issue, it sets TopologyCoordinator::_selfIndex to -1. It logs like:
Cannot find self in new replica set configuration; I must be removed{"error":{"code":74,"codeName":"NodeNotFound","errmsg":"No host described in new configuration with {version: 3, term: 1} for replica set server7781-configRS maps to this node"}}
If TopologyCoordinator::processReplSetRequestVotes then receives a request with the correct config term and version, it passes the check added in SERVER-46387, and goes on to check whether _selfConfig().isArbiter(). The node crashes with an invariant in _selfConfig() because _selfIndex is -1.
The root cause is a network problem that prevents the node from finding itself in the config. We've observed mysterious DNS issues in EC2 that sometimes prevent mongod from resolving its own address in repl::isSelf(), perhaps the build failure I'm debugging is an example of that. Regardless, we must prevent any scenario that uses -1 as a member index.