-
Type: Bug
-
Resolution: Duplicate
-
Priority: Major - P3
-
None
-
Affects Version/s: 3.2.8, 3.4.2
-
Component/s: Replication, Stability
-
None
-
ALL
-
We've hit a bug that has made our entire MongoDB cluster (15 baremetal replicasets of 2 members + arb each) unresponsive several times.
Whenever an issue occurs that can make the mongod process hangs, the cluster gets stuck too, and this issue should be detected with the replication heartbeat and provoke a primary switch.
In our case, we had IO issues, that made the mongod process locked waiting for IO and making it unresponsive, no queries could be performed to that member, they were hanging because of the IO wait.
If that happens, the buggy member is not removed from the replicaset, a primary switch doesn't happen, so the buggy member is still the primary making the whole replicaset non responsive.
Which is worse, if a replicaset is stuck in this way, the whole cluster is stuck, and all the queries done through mongos hang.
The heartbeat, according to the documentation, is doing a ping, i'm not sure what kind of ping, but this is not enough to detect a bad member, if the problem is IO (one of the main problems in databases) the ping and even a TCP connection work.
This heartbeat should do something more sophisticated like performing a query that reads from disk (important, not from the cached memory)
And, in this case when a member is completely stuck and unresponsive, probably is worth considering removing it from the replicaset rather than just transitioning it to secondary, because all the replication threads between the primary and the secondary, will hang.
- duplicates
-
SERVER-14139 Disk failure on one node can (eventually) block a whole cluster
- Closed