Loading...

Type: Bug
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Assigned Teams:

Server Triage
Operating System:
ALL

In a 3 node replica set (no sharding):

at present

p1 - secondary

p2 - primary

dr - secondary

When I connect using mongsh on each of those nodes -

On p1 : mongosh <--all-envs> --port 7253

takes close 8-10 seconds to prompt mongoshell

rs.status()

  members: [
    {
      _id: 0,
      name: 'p1:7253',
      health: 1,
      state: 2,
      stateStr: 'SECONDARY',
      uptime: 367399,
      optime: \{ ts: Timestamp({ t: 1693206907, i: 1 }), t: Long("47") },
      optimeDate: ISODate("2023-08-28T07:15:07.000Z"),
      lastAppliedWallTime: ISODate("2023-08-28T07:15:07.307Z"),
      lastDurableWallTime: ISODate("2023-08-28T07:15:07.307Z"),
      syncSourceHost: '',
      syncSourceId: -1,
      infoMessage: 'Could not find member to sync from',
      configVersion: 19,
      configTerm: 47,
      self: true,
      lastHeartbeatMessage: ''
    },
    {
      _id: 1,
      name: 'p2:7253',
      health: 0,
      state: 8,
      stateStr: '(not reachable/healthy)',
      uptime: 0,
      optime: \{ ts: Timestamp({ t: 0, i: 0 }), t: Long("-1") },
      optimeDurable: \{ ts: Timestamp({ t: 0, i: 0 }), t: Long("-1") },
      optimeDate: ISODate("1970-01-01T00:00:00.000Z"),
      optimeDurableDate: ISODate("1970-01-01T00:00:00.000Z"),
      lastAppliedWallTime: ISODate("2023-08-28T07:15:07.307Z"),
      lastDurableWallTime: ISODate("2023-08-28T07:15:07.307Z"),
      lastHeartbeat: ISODate("2023-08-28T07:15:29.974Z"),
      lastHeartbeatRecv: ISODate("2023-08-31T08:19:31.951Z"),
      pingMs: Long("0"),
      lastHeartbeatMessage: "Couldn't get a connection within the time limit",
      syncSourceHost: '',
      syncSourceId: -1,
      infoMessage: '',
      configVersion: 19,
      configTerm: 47
    },
    {
      _id: 2,
      name: 'dr:7253',
      health: 0,
      state: 8,
      stateStr: '(not reachable/healthy)',
      uptime: 0,
      optime: \{ ts: Timestamp({ t: 0, i: 0 }), t: Long("-1") },
      optimeDurable: \{ ts: Timestamp({ t: 0, i: 0 }), t: Long("-1") },
      optimeDate: ISODate("1970-01-01T00:00:00.000Z"),
      optimeDurableDate: ISODate("1970-01-01T00:00:00.000Z"),
      lastAppliedWallTime: ISODate("2023-08-28T07:15:07.307Z"),
      lastDurableWallTime: ISODate("2023-08-28T07:15:07.307Z"),
      lastHeartbeat: ISODate("2023-08-31T03:50:33.417Z"),
      lastHeartbeatRecv: ISODate("2023-08-31T08:19:34.226Z"),
      pingMs: Long("80"),
      lastHeartbeatMessage: "Couldn't get a connection within the time limit",
      syncSourceHost: '',
      syncSourceId: -1,
      infoMessage: '',
      configVersion: 19,
      configTerm: 47
    },

mongo7253 [direct: secondary] test> rs.printSecondaryReplicationInfo()
source: p1lc1mon029.ger.corp.intel.com:7253

{ syncedTo: 'Mon Aug 28 2023 07:15:07 GMT+0000 (Coordinated Universal Time)', replLag: '0 secs (0 hrs) behind the freshest member (no primary available at the moment)' }

—
source: p2:17253

{ 'no replication info, yet. State': '(not reachable/healthy)' }

—
source: dr:7253

{ 'no replication info, yet. State': '(not reachable/healthy)' }

mongo7253 [direct: secondary] test>

mongo7253 [direct: secondary] test> rs.printReplicationInfo()
actual oplog size
'2048 MB'
—
configured oplog size
'2048 MB'
—
log length start to end
'45966145 secs (12768.37 hrs)'
—
oplog first event time
'Mon Mar 14 2022 06:52:42 GMT+0000 (Coordinated Universal Time)'
—
oplog last event time
'Mon Aug 28 2023 07:15:07 GMT+0000 (Coordinated Universal Time)'
—
now
'Thu Aug 31 2023 08:22:47 GMT+0000 (Coordinated Universal Time)'
mongo7253 [direct: secondary] test>

On p2 : mongosh <--all-envs> --port 7253

immediately gives mongo shell command

mongo7253 [direct: primary] test> show dbs
admin 336.00 KiB
config 1.29 MiB
XYZ 9.05 MiB
local 132.86 MiB
mongo7253 [direct: primary] test>

rs.status()

  members: [
    {
      _id: 0,
      name: 'p1:7253',
      health: 1,
      state: 2,
      stateStr: 'SECONDARY',
      uptime: 263345,
      optime: \{ ts: Timestamp({ t: 1693206907, i: 1 }), t: Long("47") },
      optimeDurable: \{ ts: Timestamp({ t: 1693206907, i: 1 }), t: Long("47") },
      optimeDate: ISODate("2023-08-28T07:15:07.000Z"),
      optimeDurableDate: ISODate("2023-08-28T07:15:07.000Z"),
      lastAppliedWallTime: ISODate("2023-08-28T07:15:07.307Z"),
      lastDurableWallTime: ISODate("2023-08-28T07:15:07.307Z"),
      lastHeartbeat: ISODate("2023-08-31T08:24:36.250Z"),
      lastHeartbeatRecv: ISODate("2023-08-31T08:24:36.228Z"),
      pingMs: Long("1042"),
      lastHeartbeatMessage: '',
      syncSourceHost: '',
      syncSourceId: -1,
      infoMessage: '',
      configVersion: 19,
      configTerm: 47
    },
    {
      _id: 1,
      name: 'p2:7253',
      health: 1,
      state: 1,
      stateStr: 'PRIMARY',
      uptime: 263355,
      optime: \{ ts: Timestamp({ t: 1693470270, i: 1 }), t: Long("50") },
      optimeDate: ISODate("2023-08-31T08:24:30.000Z"),
      lastAppliedWallTime: ISODate("2023-08-31T08:24:30.694Z"),
      lastDurableWallTime: ISODate("2023-08-31T08:24:30.694Z"),
      syncSourceHost: '',
      syncSourceId: -1,
      infoMessage: '',
      electionTime: Timestamp(\{ t: 1693206944, i: 1 }),
      electionDate: ISODate("2023-08-28T07:15:44.000Z"),
      configVersion: 19,
      configTerm: 50,
      self: true,
      lastHeartbeatMessage: ''
    },
    {
      _id: 2,
      name: 'dr:7253',
      health: 1,
      state: 2,
      stateStr: 'SECONDARY',
      uptime: 18194,
      optime: \{ ts: Timestamp({ t: 1693470270, i: 1 }), t: Long("50") },
      optimeDurable: \{ ts: Timestamp({ t: 1693470270, i: 1 }), t: Long("50") },
      optimeDate: ISODate("2023-08-31T08:24:30.000Z"),
      optimeDurableDate: ISODate("2023-08-31T08:24:30.000Z"),
      lastAppliedWallTime: ISODate("2023-08-31T08:24:30.694Z"),
      lastDurableWallTime: ISODate("2023-08-31T08:24:30.694Z"),
      lastHeartbeat: ISODate("2023-08-31T08:24:34.685Z"),
      lastHeartbeatRecv: ISODate("2023-08-31T08:24:35.853Z"),
      pingMs: Long("82"),
      lastHeartbeatMessage: '',
      syncSourceHost: 'p2:7253',
      syncSourceId: 1,
      infoMessage: '',
      configVersion: 19,
      configTerm: 50
    },

mongo7253 [direct: primary] test> rs.printSecondaryReplicationInfo()
source: p1:7253

{ syncedTo: 'Mon Aug 28 2023 07:15:07 GMT+0000 (Coordinated Universal Time)', replLag: '-263413 secs (-73.17 hrs) behind the primary ' }

—
source: dr:7253

{ syncedTo: 'Thu Aug 31 2023 08:25:20 GMT+0000 (Coordinated Universal Time)', replLag: '0 secs (0 hrs) behind the primary ' }

mongo7253 [direct: primary] test>

mongo7253 [direct: primary] test> rs.printReplicationInfo()
actual oplog size
'2048 MB'
—
configured oplog size
'2048 MB'
—
log length start to end
'46229587 secs (12841.55 hrs)'
—
oplog first event time
'Mon Mar 14 2022 06:52:53 GMT+0000 (Coordinated Universal Time)'
—
oplog last event time
'Thu Aug 31 2023 08:26:00 GMT+0000 (Coordinated Universal Time)'
—
now
'Thu Aug 31 2023 08:26:03 GMT+0000 (Coordinated Universal Time)'
mongo7253 [direct: primary] test>

On dr: mongosh <--all-envs> --port 7253

  members: [
    {
      _id: 0,
      name: 'p1:7253',
      health: 1,
      state: 2,
      stateStr: 'SECONDARY',
      uptime: 18319,
      optime: \{ ts: Timestamp({ t: 1693206907, i: 1 }), t: Long("47") },
      optimeDurable: \{ ts: Timestamp({ t: 1693206907, i: 1 }), t: Long("47") },
      optimeDate: ISODate("2023-08-28T07:15:07.000Z"),
      optimeDurableDate: ISODate("2023-08-28T07:15:07.000Z"),
      lastAppliedWallTime: ISODate("2023-08-28T07:15:07.307Z"),
      lastDurableWallTime: ISODate("2023-08-28T07:15:07.307Z"),
      lastHeartbeat: ISODate("2023-08-31T08:26:33.995Z"),
      lastHeartbeatRecv: ISODate("2023-08-31T08:26:37.495Z"),
      pingMs: Long("959"),
      lastHeartbeatMessage: '',
      syncSourceHost: '',
      syncSourceId: -1,
      infoMessage: '',
      configVersion: 19,
      configTerm: 47
    },
    {
      _id: 1,
      name: 'p2:7253',
      health: 1,
      state: 1,
      stateStr: 'PRIMARY',
      uptime: 18320,
      optime: \{ ts: Timestamp({ t: 1693470388, i: 2 }), t: Long("50") },
      optimeDurable: \{ ts: Timestamp({ t: 1693470388, i: 2 }), t: Long("50") },
      optimeDate: ISODate("2023-08-31T08:26:28.000Z"),
      optimeDurableDate: ISODate("2023-08-31T08:26:28.000Z"),
      lastAppliedWallTime: ISODate("2023-08-31T08:26:28.836Z"),
      lastDurableWallTime: ISODate("2023-08-31T08:26:28.836Z"),
      lastHeartbeat: ISODate("2023-08-31T08:26:36.658Z"),
      lastHeartbeatRecv: ISODate("2023-08-31T08:26:37.492Z"),
      pingMs: Long("82"),
      lastHeartbeatMessage: '',
      syncSourceHost: '',
      syncSourceId: -1,
      infoMessage: '',
      electionTime: Timestamp(\{ t: 1693206944, i: 1 }),
      electionDate: ISODate("2023-08-28T07:15:44.000Z"),
      configVersion: 19,
      configTerm: 50
    },
    {
      _id: 2,
      name: 'dr:7253',
      health: 1,
      state: 2,
      stateStr: 'SECONDARY',
      uptime: 18334,
      optime: \{ ts: Timestamp({ t: 1693470388, i: 2 }), t: Long("50") },
      optimeDate: ISODate("2023-08-31T08:26:28.000Z"),
      lastAppliedWallTime: ISODate("2023-08-31T08:26:28.836Z"),
      lastDurableWallTime: ISODate("2023-08-31T08:26:28.836Z"),
      syncSourceHost: 'p2:7253',
      syncSourceId: 1,
      infoMessage: '',
      configVersion: 19,
      configTerm: 50,
      self: true,
      lastHeartbeatMessage: ''
    },

mongo7253 [direct: secondary] test> rs.printSecondaryReplicationInfo()
source: p1:7253

{ syncedTo: 'Mon Aug 28 2023 07:15:07 GMT+0000 (Coordinated Universal Time)', replLag: '-263523 secs (-73.2 hrs) behind the primary ' }

—
source: dr:7253

{ syncedTo: 'Thu Aug 31 2023 08:27:10 GMT+0000 (Coordinated Universal Time)', replLag: '0 secs (0 hrs) behind the primary ' }

mongo7253 [direct: secondary] test>

mongo7253 [direct: secondary] test> rs.printReplicationInfo()
actual oplog size
'2048 MB'
—
configured oplog size
'2048 MB'
—
log length start to end
'46229695 secs (12841.58 hrs)'
—
oplog first event time
'Mon Mar 14 2022 06:53:15 GMT+0000 (Coordinated Universal Time)'
—
oplog last event time
'Thu Aug 31 2023 08:28:10 GMT+0000 (Coordinated Universal Time)'
—
now
'Thu Aug 31 2023 08:28:12 GMT+0000 (Coordinated Universal Time)'
mongo7253 [direct: secondary] test>

=================

when I tail the log for REPL messages ;

On p1: keep getting these messages
tail -f /instances/mongo_db7253/logs/mongodb.log | grep REPL

{"t":\{"$date":"2023-08-31T08:28:58.394+00:00"}

,"s":"I", "c":"REPL", "id":21401, "ctx":"conn26379","msg":"Scheduling heartbeat to fetch a newer config","attr":{"configTerm":50,"configVersion":19,"senderHost":"dr:7253"}}

{"t":\{"$date":"2023-08-31T08:29:00.682+00:00"}

,"s":"I", "c":"REPL", "id":21401, "ctx":"conn2169","msg":"Scheduling heartbeat to fetch a newer config","attr":{"configTerm":50,"configVersion":19,"senderHost":"p2:7253"}}

On p2 & dr: I get nothing
tail -f /instances/mongo_db7253/logs/mongodb.log | grep REPL

however I get below
tail -f /instances/mongo_db7253/logs/mongodb.log | grep FlowControlRefresher

{"t":\{"$date":"2023-08-31T08:31:39.550+00:00"}

,"s":"W", "c":"STORAGE", "id":22225, "ctx":"FlowControlRefresher","msg":"Flow control is engaged and the sustainer point is not moving. Please check the health of all secondaries."}

===================

question number :

why is my p1 having negative lag number (73 hours)?
whether P1 primary or p2 ?
why p1 is primary because its ahead of other 2 nodes as per rs.printSecondaryReplicationInfo() on p2 and dr
why p2 is primary is because when connected locally using mongosh it shows as primary.
why p1 says - no primary available at the moment when ran rs.printSecondaryReplicationInfo() from p1
why p2 and dr state show as 'not reachable/healthy' when ran rs.status() on P1
but they show p1, DR as secondary and p2 as primary when ran rs.status() on p2 and DR

why p2 and DR show lastHeartbeatMessage: "Couldn't get a connection within the time limit" for when ran rs,satus() on p1 and shows healthy on p2 and dr.

Is p1 a primary as its trying to replicate to p2 and Dr but not happening and p2 is false primary ?
what's the bug with FlowControl to display message like
"Flow control is engaged and the sustainer point is not moving. Please check the health of all secondaries."

how to fix this replication issue ?
my mongodb server version is 5.0.5 on all 3 nodes.

Details

Description

Attachments

Activity

People

Dates