Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-92049

Replica set primary node network failure does not trigger secondary node election

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.2.16
    • Component/s: None
    • None
    • Environment:
    • ALL
    • Hide

      I tried to reproduct the issue in a test environment (with the same MongoDB version and structure as the production environment, but on a smaller scale) by interrupting the primary's network (using the `ip link set eth0 down` command), but the replica set quickly performed an election, and the original secondary became the primary.
      I am unable to directly reproduce the issue in the production environment where the problem occurred.

      Show
      I tried to reproduct the issue in a test environment (with the same MongoDB version and structure as the production environment, but on a smaller scale) by interrupting the primary's network (using the `ip link set eth0 down` command), but the replica set quickly performed an election, and the original secondary became the primary. I am unable to directly reproduce the issue in the production environment where the problem occurred.

      I have a MongoDB 3.2.16 sharded cluster with 20 shards, each shard have 1 primary, 1 secondary, and 1 arbiter node. Due to a network outage in one of the cloud provider's availability zones, 5 of the shard primaries lost connectivity, but during the 30-minute outage, none of these 5 shards triggered a re-election.

      I checked the logs of the 3 nodes in each of the affected shards: * The primary node logs mainly show connection failures to other mongo nodes, with no other obvious issues.

      • The secondary node logs do not have any election-related logs, and even during the network outage, the logs occasionally show successful connections to the primary.
      • The arbiter node continuously prints heartbeat request timeout failures to the primary, but there are no election-related logs, for example:
      2024-07-02T10:04:59.095+0800 I ASIO     [NetworkInterfaceASIO-Replication-0] Ending connection to host N-Mongo-S20-1:27017 due to bad connection status; 0 connections to that host remain open
      2024-07-02T10:04:59.095+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to N-Mongo-S20-1:27017; ExceededTimeLimit: Operation timed out
      2024-07-02T10:05:04.095+0800 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to N-Mongo-S20-1:27017
      2024-07-02T10:05:14.096+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to N-Mongo-S20-1:27017; ExceededTimeLimit: Couldn't get a connection within the time limit
      2024-07-02T10:05:24.096+0800 I ASIO     [NetworkInterfaceASIO-Replication-0] Failed to connect to N-Mongo-S20-1:27017 - ExceededTimeLimit: Operation timed out

       
      Is this a bug related to the MongoDB version, or is there another reason causing the lack of election trigger? What additional information should I provide to help investigate this issue?
      Any advice or guidance would be greatly appreciated. Thank you!

            Assignee:
            chris.kelly@mongodb.com Chris Kelly
            Reporter:
            476420725@qq.com Yutao Huang
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: