Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 3.2.16
Component/s: None
Labels:
None
Environment:

Hide
MongoDB Version:
    db version v3.2.16
    git version: 056bf45128114e44c5358c7a8776fb582363e094
    allocator: tcmalloc
    modules: none
    build environment:
        distarch: x86_64
        target_arch: x86_64
Operating System: Ubuntu 14.04.6 LTS
Linux Kernel Info: Linux N-Mongo-S20-1 4.4.0-93-generic #116~14.04.1-Ubuntu SMP Mon Aug 14 16:07:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Show
MongoDB Version:     db version v3.2.16     git version: 056bf45128114e44c5358c7a8776fb582363e094     allocator: tcmalloc     modules: none     build environment:         distarch: x86_64         target_arch: x86_64 Operating System: Ubuntu 14.04.6 LTS Linux Kernel Info: Linux N-Mongo-S20-1 4.4.0-93-generic #116~14.04.1-Ubuntu SMP Mon Aug 14 16:07:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Operating System:
ALL
Steps To Reproduce:

Hide

I tried to reproduct the issue in a test environment (with the same MongoDB version and structure as the production environment, but on a smaller scale) by interrupting the primary's network (using the `ip link set eth0 down` command), but the replica set quickly performed an election, and the original secondary became the primary.
I am unable to directly reproduce the issue in the production environment where the problem occurred.

Show
I tried to reproduct the issue in a test environment (with the same MongoDB version and structure as the production environment, but on a smaller scale) by interrupting the primary's network (using the `ip link set eth0 down` command), but the replica set quickly performed an election, and the original secondary became the primary. I am unable to directly reproduce the issue in the production environment where the problem occurred.

I have a MongoDB 3.2.16 sharded cluster with 20 shards, each shard have 1 primary, 1 secondary, and 1 arbiter node. Due to a network outage in one of the cloud provider's availability zones, 5 of the shard primaries lost connectivity, but during the 30-minute outage, none of these 5 shards triggered a re-election.

I checked the logs of the 3 nodes in each of the affected shards: * The primary node logs mainly show connection failures to other mongo nodes, with no other obvious issues.

The secondary node logs do not have any election-related logs, and even during the network outage, the logs occasionally show successful connections to the primary.
The arbiter node continuously prints heartbeat request timeout failures to the primary, but there are no election-related logs, for example:

2024-07-02T10:04:59.095+0800 I ASIO     [NetworkInterfaceASIO-Replication-0] Ending connection to host N-Mongo-S20-1:27017 due to bad connection status; 0 connections to that host remain open
2024-07-02T10:04:59.095+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to N-Mongo-S20-1:27017; ExceededTimeLimit: Operation timed out
2024-07-02T10:05:04.095+0800 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to N-Mongo-S20-1:27017
2024-07-02T10:05:14.096+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to N-Mongo-S20-1:27017; ExceededTimeLimit: Couldn't get a connection within the time limit
2024-07-02T10:05:24.096+0800 I ASIO     [NetworkInterfaceASIO-Replication-0] Failed to connect to N-Mongo-S20-1:27017 - ExceededTimeLimit: Operation timed out

Is this a bug related to the MongoDB version, or is there another reason causing the lack of election trigger? What additional information should I provide to help investigate this issue?
Any advice or guidance would be greatly appreciated. Thank you!

Assignee:: Chris Kelly

Reporter:: Yutao Huang

Participants:: Chris Kelly, Yutao Huang

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: Jul 02 2024 07:11:44 AM UTC

Updated:: Jul 02 2024 10:40:42 PM UTC

Resolved:: Jul 02 2024 10:40:42 PM UTC

Details

Description

Attachments

Activity

People

Dates