-
Type: Bug
-
Resolution: Incomplete
-
Priority: Major - P3
-
None
-
Affects Version/s: 2.4.6
-
Component/s: Replication
-
Environment:Linux mongo0 3.2.0-4-amd64 #1 SMP Debian 3.2.46-1+deb7u1 x86_64 GNU/Linux
-
Replication
-
Linux
-
We had a hardware issue with our Mongo replica set primary. The exact reason is still unknown, but it appears that I/O commands to its SSD (which holds all MongoDB data but not the operating system or the MongoDB installation itself) did not return.
dmesg output (full output is attached):
[2195482.937229] INFO: task mongod:2731 blocked for more than 120 seconds.
[2195482.937416] mongod D ffff88063fc13780 0 2731 1 0x00000000
[2195482.937421] ffff88033147d1e0 0000000000000086 ffff880600000000 ffff880333239590
[2195482.937426] 0000000000013780 ffff8803324adfd8 ffff8803324adfd8 ffff88033147d1e0
[2195482.937432] ffffffff8101360a 00000001810660a1 ffff8803316822f0 ffff88063fc13fd0
[.....]
MongoDB's log file does not show anything out of the ordinary.
Result:
The replica set's heartbeat though that our primary was fine, but it was not actually doing any work (all it did is wait for a broken disc). Thus connections piled up and our entire application stalled. As soon as I manually shut down MongoDB on that machine, the failover happened as it should (although the Java driver didn't recover properly after that, but that's a separate issue).
Now, I'm not even sure if this is a valid bug report, but I think there is some room for improvement in the replica set's heartbeat code. I can imagine various situations in which a machine is responding to heartbeat, but not actually working, e.g. "swap to death" situations, all sorts of I/O issues (e.g. NFS/iSCSI/whatever mounted file system with network problems), hardware issues similar to the ones we had.
- is related to
-
SERVER-6028 Too many open connections kills primary but doesn't trigger failover
- Closed