Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Incomplete
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 2.4.6
Component/s: Replication
Labels:
- elections
Environment:
Linux mongo0 3.2.0-4-amd64 #1 SMP Debian 3.2.46-1+deb7u1 x86_64 GNU/Linux

Assigned Teams:

Replication
Operating System:
Linux
Steps To Reproduce:

Hide

Not tested under controlled circumstances: Set up replica set, store MongoDB data on separate device on Primary; make that device unresponsive (but keep it mounted).

Show
Not tested under controlled circumstances: Set up replica set, store MongoDB data on separate device on Primary; make that device unresponsive (but keep it mounted).
Confidence Status:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

We had a hardware issue with our Mongo replica set primary. The exact reason is still unknown, but it appears that I/O commands to its SSD (which holds all MongoDB data but not the operating system or the MongoDB installation itself) did not return.

dmesg output (full output is attached):
[2195482.937229] INFO: task mongod:2731 blocked for more than 120 seconds.
[2195482.937416] mongod D ffff88063fc13780 0 2731 1 0x00000000
[2195482.937421] ffff88033147d1e0 0000000000000086 ffff880600000000 ffff880333239590
[2195482.937426] 0000000000013780 ffff8803324adfd8 ffff8803324adfd8 ffff88033147d1e0
[2195482.937432] ffffffff8101360a 00000001810660a1 ffff8803316822f0 ffff88063fc13fd0
[.....]

MongoDB's log file does not show anything out of the ordinary.

Result:
The replica set's heartbeat though that our primary was fine, but it was not actually doing any work (all it did is wait for a broken disc). Thus connections piled up and our entire application stalled. As soon as I manually shut down MongoDB on that machine, the failover happened as it should (although the Java driver didn't recover properly after that, but that's a separate issue).

Now, I'm not even sure if this is a valid bug report, but I think there is some room for improvement in the replica set's heartbeat code. I can imagine various situations in which a machine is responding to heartbeat, but not actually working, e.g. "swap to death" situations, all sorts of I/O issues (e.g. NFS/iSCSI/whatever mounted file system with network problems), hardware issues similar to the ones we had.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

dmesg.txt
88 kB
Oct 01 2013 09:53:17 AM UTC

is related to

SERVER-6028 Too many open connections kills primary but doesn't trigger failover

Closed

Assignee:: [DO NOT USE] Backlog - Replication Team

Reporter:: David Gubler

Participants:: [DO NOT USE] Backlog - Replication Team, Daniel Pasette, David Gubler, Judah Schvimer

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Created:: Oct 01 2013 09:53:17 AM UTC

Updated:: Dec 06 2022 05:16:50 AM UTC

Resolved:: Jan 03 2020 07:37:38 PM UTC

GA Target Date:: None

Public Preview Target Date:: None

Private Preview Target Date:: None

Experiment Target Date:: None

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates