Uploaded image for project: 'Node.js Driver'
  1. Node.js Driver
  2. NODE-893

Replica Set: Unavailability when a primary fails and results in timeout errors

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • 2.2.13, 2.2.14
    • Affects Version/s: None
    • Component/s: None

      I had this issue when working with MongoDB Atlas. Twice so far in the last 10 days, there were two 30 minutes outages where the primary went down inexplicably. Just for reference, the issue in question is being treated in MMSSUPPORT-12572.

      As client, I'm using node.js 6.5.0, with Mongoose 4.6.8, which comes with mongodb native driver 2.2.11, but I was also able to reproduce the issue by using 4.7.2-pre, which comes with mongodb native driver 2.2.12. I have several processes connecting to the database. I'm also using the recommended connection string format, by enumerating all members, and also providing the replica set name.

      When the primary goes down, the remaining members of the replica set timeout when attempting to connect the primary, they run an election, and choose a new primary correctly. However, they keep trying to contact the failing primary, resulting in timeout errors. This is fine, except that they also cause connecting clients to timeout as well after sometime (when connecting to the working members, see below description). I'm assuming this is the reason, or part of it, because I never get this behavior when the replica set has all its members up.

      The client doesn't seem to work anymore when the primary goes down, and to be honest I'm not sure why. Restarting the client seems to help. What happens is that it also tries to connect the failing primary, resulting in a timeout error. After this error happens, the server seems to connect the database by using the remaining members of the replica set. But, after a short time, the database disconnects the client, rendering the client unusable for operations that require database access.

      When try to replicate the issue, the key is to simulate a timeout from the failing primary. It is not the same as just killing the primary, because that would result in a much faster “Connection refused” error. What happens instead, is a timeout error.

      Here's some insight from my research:

      I was able to replicate the issue on a local setup replica, as stated in https://docs.mongodb.com/v3.2/tutorial/deploy-replica-set-for-testing/. I connected the primary using a TCP proxy (check https://github.com/Shopify/toxiproxy), and then simulated a network outage like what happens with MongoDB Atlas when a primary goes down. The simulation works by adding a timeout setting to both upstream and downstream.

      Steps to setup the environment:

      toxiproxy-server  &
      toxiproxy-cli create mongod_primary -l localhost:12345 -u localhost:27017
      

      When adding the first member to the replica set, use the 12345 port, which is the proxy one.

      Then, connect a client to the replica set (make sure to list the primary with the 12345 port as well), and then execute the following:

      toxiproxy-cli toxic add mongod_primary -t timeout -a timeout=0 —-downstream
      toxiproxy-cli toxic add mongod_primary -t timeout -a timeout=0 —-upstream
      

      You’ll see that the other members elect a new primary, but regarding mongo client, it gets unusable. All attempts to use the database just never return to the caller.
      Even when restarting the client, it often gets disconnected because a timeout error (to the available members, which is weird, because they are working). The same happens if you try removing the failing replica member from the connection string. It’s almost like when the remaining members try to connect the failing primary, and that results in timeout errors (leaving the connection attempts hanging until the error happens), the availability of these members gets reduced somehow, causing connection drops / timeouts to the connecting clients.

      PS: I found that toxiproxy crashes on Mac OS X Sierra after a short time when being used. I was able to make the issue go away by using the latest Go version and building the project from source.)

        1. app_server.log
          68 kB
          Gian Franco Zabarino
        2. replicas.log
          68 kB
          Gian Franco Zabarino

            Assignee:
            christkv Christian Amor Kvalheim
            Reporter:
            gfzabarino Gian Franco Zabarino
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: