-
Type: Bug
-
Resolution: Incomplete
-
Priority: Major - P3
-
None
-
Affects Version/s: 4.0.18, 4.0.19
-
Component/s: Replication
-
None
-
ALL
We have a Primary-Secondary-Arbiter setup with version 4.0.18 (upgraded to 4.0.19 but still the problem persists) which started experiencing some very strange behavior. The three nodes are called Osiril, Horus and Isis (Arbiter). Every few minutes (from 2 to 10 min looking at the log) we get this in the log:
2020-06-15T18:56:29.734+0300 I REPL [replexec-6] Member Isis.local:27017 is now in state RS_DOWN 2020-06-15T18:56:29.734+0300 I REPL [replexec-6] Member osiris.local:27017 is now in state RS_DOWN 2020-06-15T18:56:29.735+0300 I REPL [replexec-6] can't see a majority of the set, relinquishing primary 2020-06-15T18:56:29.735+0300 I REPL [replexec-6] Stepping down from primary in response to heartbeat 2020-06-15T18:56:29.735+0300 I REPL [replexec-6] transition to SECONDARY from PRIMARY 2020-06-15T18:56:29.735+0300 I NETWORK [replexec-6] Skip closing connection for connection # 40 2020-06-15T18:56:29.735+0300 I NETWORK [replexec-6] Skip closing connection for connection # 35 2020-06-15T18:56:29.735+0300 I REPL [replexec-0] Member osiris.local:27017 is now in state SECONDARY 2020-06-15T18:56:29.735+0300 I REPL [replexec-1] Member Isis.local:27017 is now in state ARBITER ... 2020-06-15T18:56:30.236+0300 I REPL [replexec-1] Member osiris.local:27017 is now in state PRIMARY ... 2020-06-15T18:56:31.225+0300 I REPL [rsBackgroundSync] sync source candidate: osiris.local:27017 2020-06-15T18:56:31.227+0300 I REPL [rsBackgroundSync] Changed sync source from empty to osiris.local:27017 2020-06-15T18:56:31.227+0300 I ASIO [RS] Connecting to osiris.local:27017 ... 2020-06-15T18:57:29.753+0300 I REPL [replexec-0] Starting an election, since we've seen no PRIMARY in the past 10000ms 2020-06-15T18:57:29.753+0300 I REPL [replexec-0] conducting a dry run election to see if we could be elected. current term: 1136 2020-06-15T18:57:29.754+0300 I ASIO [Replication] Connecting to Isis.local:27017 2020-06-15T18:57:29.754+0300 I ASIO [Replication] Connecting to osiris.local:27017 2020-06-15T18:57:29.754+0300 I REPL [replexec-1] VoteRequester(term 1136 dry run) received a no vote from Isis.local:27017 with reason "can see a healthy primary (osiris.local:27017) of equal or greater priority"; response message: \{ term: 1136, voteGranted: false, reason: "can see a healthy primary (osiris.local:27017) of equal or greater priority", ok: 1.0 } 2020-06-15T18:57:29.754+0300 I REPL [replexec-0] VoteRequester(term 1136 dry run) received a yes vote from osiris.local:27017; response message: \{ term: 1136, voteGranted: true, reason: "", ok: 1.0, operationTime: Timestamp(1592236637, 1), $clusterTime: { clusterTime: Timestamp(1592236637, 1), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } } } 2020-06-15T18:57:29.754+0300 I REPL [replexec-0] dry election run succeeded, running for election in term 1137 2020-06-15T18:57:29.755+0300 I REPL [replexec-6] VoteRequester(term 1137) received a no vote from Isis.local:27017 with reason "can see a healthy primary (osiris.local:27017) of equal or greater priority"; response message: \{ term: 1137, voteGranted: false, reason: "can see a healthy primary (osiris.local:27017) of equal or greater priority", ok: 1.0 } 2020-06-15T18:57:29.759+0300 I REPL [replexec-1] VoteRequester(term 1137) received a yes vote from osiris.local:27017; response message: \{ term: 1137, voteGranted: true, reason: "", ok: 1.0, operationTime: Timestamp(1592236637, 1), $clusterTime: { clusterTime: Timestamp(1592236637, 1), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } } } 2020-06-15T18:57:29.759+0300 I REPL [replexec-1] election succeeded, assuming primary role in term 1137 2020-06-15T18:57:29.760+0300 I REPL [replexec-1] transition to PRIMARY from SECONDARY 2020-06-15T18:57:29.760+0300 I REPL [replexec-1] Resetting sync source to empty, which was osiris.local:27017 2020-06-15T18:57:29.760+0300 I REPL [replexec-1] Entering primary catch-up mode. 2020-06-15T18:57:29.761+0300 I REPL [replexec-0] Member osiris.local:27017 is now in state SECONDARY 2020-06-15T18:57:29.761+0300 I REPL [replexec-0] Caught up to the latest optime known via heartbeats after becoming primary. Target optime: \{ ts: Timestamp(1592236637, 1), t: 1136 }. My Last Applied: \{ ts: Timestamp(1592236637, 1), t: 1136 } 2020-06-15T18:57:29.761+0300 I REPL [replexec-0] Exited primary catch-up mode. 2020-06-15T18:57:29.761+0300 I REPL [replexec-0] Stopping replication producer 2020-06-15T18:57:29.761+0300 I REPL [ReplBatcher] Oplog buffer has been drained in term 1137 2020-06-15T18:57:29.761+0300 I REPL [rsBackgroundSync] Replication producer stopped after oplog fetcher finished returning a batch from our sync source. Abandoning this batch of oplog entries and re-evaluating our sync source. 2020-06-15T18:57:29.761+0300 I CONNPOOL [RS] Ending connection to host osiris.local:27017 due to bad connection status; 1 connections to that host remain open 2020-06-15T18:57:29.767+0300 I REPL [rsSync-0] transition to primary complete; database writes are now permitted ... 2020-06-15T18:58:29.734+0300 I REPL [replexec-7] Member Isis.local:27017 is now in state RS_DOWN 2020-06-15T18:58:29.735+0300 I REPL [replexec-7] Member osiris.local:27017 is now in state RS_DOWN 2020-06-15T18:58:29.735+0300 I REPL [replexec-7] can't see a majority of the set, relinquishing primary 2020-06-15T18:58:29.735+0300 I REPL [replexec-7] Stepping down from primary in response to heartbeat 2020-06-15T18:58:29.735+0300 I REPL [replexec-7] transition to SECONDARY from PRIMARY
And this repeats forever. It looks like some timer is killing the connection with the other replica and this causes the switch between primary and secondary but we have no way to diagnose this. This setup has been running for about 10 years now. The only thing we have changed is to move the virtual machine Horus to another server hardware which is faster. This problem causes all clients to break the connection every few minutes and in some cases crashes some services.
Attached are the log files from all servers.