Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-28422

Cluster stuck because replication heartbeat does not detect hanging members

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.2.8, 3.4.2
    • Component/s: Replication, Stability
    • None
    • ALL
    • Hide

      I couldn't reproduce the same IO issue we experienced because the baremetal setup is complex and i probably require the same hardware, disks, raid controller, etc.

      But i managed to reproduce the exact same symptoms using NFS:

      (setup for Ubuntu)

      1 - Setup a simple NFS server exporting an empty directory: https://help.ubuntu.com/community/SettingUpNFSHowTo

      2- Install nfs-common and mount an NFS directory:

      mount -t nfs -o noatime,bg,nolock,proto=tcp,port=2049 <nfs-server-ip>:/ /mongodata1
      

      3- Create a replicaset of 2 members and an arbiter. One of the members, the PRIMARY, will have its storage.dbPath pointing to the NFS directory: /mongodata1

      4- Write data to the replicaset and see everything works as expected, test the primary switchs, etc.

      5- When the member that has the dbPath pointing to the NFS directory is the PRIMARY, in the NFS server stop the NFS daemon:

      service nfs-kernel-server stop
      

      6- Keep writing in the replicaset. You will be able to write for some time (probably because it is still using the file system cache), but if you perform a 'show dbs' it hangs or after some seconds the writes will hang too.
      The hanging member (PRIMARY) is never transitioned to SECONDARY, so the full replicaset becomes unresponsive.
      The hanging mongod process won't be stopped with SIGTERM, you will need SIGKILL.

      This, can be extended to a sharded cluster. Create another replicaset, create the shard using mongos, shard a collection, and write data to that collection from mongos.
      Then do the same as before stopping the NFS daemon in the NFS server.
      You will experience the same issue from mongos, after a few seconds, the queries will hang, or if you don't want to wait just do 'show dbs'.

      At this point, the whole cluster is unresponsive.

      Another important thing to note is that, with the IO locking issue we had, once it happened in a secondary member. This also made the whole replicaset stuck, and therefore the whole cluster too. I couldn't manage to reproduce this with NFS.

      I've reproduced this in 3.2.8 and in 3.4.2

      Show
      I couldn't reproduce the same IO issue we experienced because the baremetal setup is complex and i probably require the same hardware, disks, raid controller, etc. But i managed to reproduce the exact same symptoms using NFS: (setup for Ubuntu) 1 - Setup a simple NFS server exporting an empty directory: https://help.ubuntu.com/community/SettingUpNFSHowTo 2- Install nfs-common and mount an NFS directory: mount -t nfs -o noatime,bg,nolock,proto=tcp,port=2049 <nfs-server-ip>:/ /mongodata1 3- Create a replicaset of 2 members and an arbiter. One of the members, the PRIMARY, will have its storage.dbPath pointing to the NFS directory: /mongodata1 4- Write data to the replicaset and see everything works as expected, test the primary switchs, etc. 5- When the member that has the dbPath pointing to the NFS directory is the PRIMARY, in the NFS server stop the NFS daemon: service nfs-kernel-server stop 6- Keep writing in the replicaset. You will be able to write for some time (probably because it is still using the file system cache), but if you perform a 'show dbs' it hangs or after some seconds the writes will hang too. The hanging member (PRIMARY) is never transitioned to SECONDARY, so the full replicaset becomes unresponsive. The hanging mongod process won't be stopped with SIGTERM, you will need SIGKILL. This, can be extended to a sharded cluster. Create another replicaset, create the shard using mongos, shard a collection, and write data to that collection from mongos. Then do the same as before stopping the NFS daemon in the NFS server. You will experience the same issue from mongos, after a few seconds, the queries will hang, or if you don't want to wait just do 'show dbs'. At this point, the whole cluster is unresponsive. Another important thing to note is that, with the IO locking issue we had, once it happened in a secondary member. This also made the whole replicaset stuck, and therefore the whole cluster too. I couldn't manage to reproduce this with NFS. I've reproduced this in 3.2.8 and in 3.4.2

      We've hit a bug that has made our entire MongoDB cluster (15 baremetal replicasets of 2 members + arb each) unresponsive several times.

      Whenever an issue occurs that can make the mongod process hangs, the cluster gets stuck too, and this issue should be detected with the replication heartbeat and provoke a primary switch.

      In our case, we had IO issues, that made the mongod process locked waiting for IO and making it unresponsive, no queries could be performed to that member, they were hanging because of the IO wait.
      If that happens, the buggy member is not removed from the replicaset, a primary switch doesn't happen, so the buggy member is still the primary making the whole replicaset non responsive.
      Which is worse, if a replicaset is stuck in this way, the whole cluster is stuck, and all the queries done through mongos hang.

      The heartbeat, according to the documentation, is doing a ping, i'm not sure what kind of ping, but this is not enough to detect a bad member, if the problem is IO (one of the main problems in databases) the ping and even a TCP connection work.
      This heartbeat should do something more sophisticated like performing a query that reads from disk (important, not from the cached memory)

      And, in this case when a member is completely stuck and unresponsive, probably is worth considering removing it from the replicaset rather than just transitioning it to secondary, because all the replication threads between the primary and the secondary, will hang.

            Assignee:
            ramon.fernandez@mongodb.com Ramon Fernandez Marina
            Reporter:
            victorgp VictorGP
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: