Uploaded image for project: 'Python Driver'
  1. Python Driver
  2. PYTHON-917

MongoClient keeps trying auth on recovering member after primary hangup

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • 2.7
    • Affects Version/s: 2.6.3
    • Component/s: None
    • None

      UPDATE: Although the presentation of this issue was new to us, it's a known bug in PyMongo before 2.7. I didn't realize I'd fixed it, along with a large class of similar bugs related to replica set reconnection, when I rewrote MongoClient for PYTHON-487.

      PyMongo 2's rapidly obsolescing MongoClient can get stuck trying to authenticate to a recovering member, even if a primary is available. There are a few ways this can happen, all intricate. The particular case in which this was reported was:

      1. Replica set with a primary "A" and a resyncing member "B"
      2. MongoClient started with connection string "A,B" and no "replicaSet" keyword (also note, not PyMongo 2's MongoReplicaSetClient)
      3. MongoClient.database.authenticate("user", "password") succeeds against the primary
      4. An operation ("find_one" or whatever) fails against the primary with network error
      5. On the next operation, MongoClient attempts rediscovery by calling "ismaster" on A and B again. Since it has cached the "user" / "password" credentials, unfortunately, it attempts authentication against each node as it connects.
      6. When it tries to reach host "B", B is resyncing and doesn't have the user's record yet, so auth fails.
      7. MongoClient throws OperationFailure("auth fails"), and continues to do so even after the primary becomes available again.

      This can be reproduced with MockupDB. First "pip install git+git://github.com/ajdavis/mongo-mockup-db.git". MockupDB requires PyMongo 3, so run it in a separate virtualenv. Start a mock replica set with two members:

      from time import sleep
      
      from mockupdb import MockupDB, OpQuery
      
      
      primary, recovering = servers = [MockupDB(port) for port in 2000, 2001]
      for server in servers:
          server.verbose = True
          server.run()
      
      hosts = [server.address_string for server in servers]
      primary.autoresponds(
          'ismaster',
          ismaster=True, setName='rs', hosts=hosts)
      
      recovering.autoresponds(
          'ismaster',
          ismaster=False, secondary=False, setName='rs', hosts=hosts)
      
      # Recovering member hasn't replicated user records yet: nonce ok, auth fails.
      recovering.autoresponds('getnonce', nonce='abcd')
      recovering.autoresponds('authenticate', ok=0, code=18, errmsg='auth fails')
      
      # Initial auth succeeds on primary, next op fails with network error.
      primary.receives('getnonce', timeout=100).ok(nonce='abcd')
      primary.receives('authenticate').ok()
      primary.receives().hangup()
      
      # Primary returns but MongoClient won't use it.
      primary.autoresponds(OpQuery, {})
      primary.autoresponds('getnonce', nonce='abcd')
      primary.autoresponds('authenticate')
      
      # Wait for Ctrl-C.
      sleep(1000)
      

      Then connect a client with PyMongo 2 (this was tested with 2.6.3, but all recent PyMongo 2 versions will act the same):

      import traceback
      from time import sleep
      
      from pymongo import MongoClient
      
      client = MongoClient('localhost:2000,localhost:2001')
      client.db.authenticate('user', 'password')
      
      for _ in range(15):
          try:
              print client.db.collection.find_one()
          except:
              traceback.print_exc()
              sleep(0.5)
      

      The client's initial auth succeeds, then each find_one fails with an OperationFailure and characteristic traceback:

      OperationFailure: command SON([('authenticate', 1), ('user', u'user'), ('nonce', u'abcd'), ('key', u'3cb54e6d2ddc126d9fb2445b068020ab')]) failed: auth fails
      Traceback (most recent call last):
        File "CS-20300-client.py", line 11, in <module>
          print client.db.collection.find_one()
        File "pymongo/collection.py", line 604, in find_one
          for result in self.find(spec_or_id, *args, **kwargs).limit(-1):
        File "pymongo/cursor.py", line 904, in next
          if len(self.__data) or self._refresh():
        File "pymongo/cursor.py", line 848, in _refresh
          self.__uuid_subtype))
        File "pymongo/cursor.py", line 782, in __send_message
          res = client._send_message_with_response(message, **kwargs)
        File "pymongo/mongo_client.py", line 1038, in _send_message_with_response
          sock_info = self.__socket()
        File "pymongo/mongo_client.py", line 777, in __socket
          self.__check_auth(sock_info)
        File "pymongo/mongo_client.py", line 469, in __check_auth
          sock_info, self.__simple_command)
        File "pymongo/auth.py", line 214, in authenticate
          auth_func(credentials[1:], sock_info, cmd_func)
        File "pymongo/auth.py", line 194, in _authenticate_mongo_cr
          cmd_func(sock_info, source, query)
        File "pymongo/mongo_client.py", line 607, in __simple_command
          helpers._check_command_response(response, None, msg)
        File "pymongo/helpers.py", line 147, in _check_command_response
          raise OperationFailure(msg % errmsg, code)
      

      PyMongo 3's MongoClient, on the other hand, behaves as designed: it throws AutoReconnect once, then successfully reconnects to the primary.

      This shows a couple bugs in MongoClient. First, it shouldn't attempt auth against a member in an unknown state during reconnection. PyMongo 3's MongoClient does not. Second, if there is any failure while rediscovering the state of a member, MongoClient shouldn't stay pinned to that member's host and port afterward. Again, this is fixed in PyMongo 3.

      This bug has existed in PyMongo 2 for as long as PyMongo has supported authentication.

      A possible solution is to update this code in PyMongo 2's MongoClient.__find_node:

              for candidate in candidates:
                  try:
                      node, ismaster, isdbgrid, res_time = self.__try_node(candidate)
                      # ... snip ...
                      return node
                  except OperationFailure:
                      # The server is available but something failed, probably auth.
                      raise
                  except Exception, why:
                      errors.append(str(why))
      

      This code was written assuming that an auth failure is a permanent and global condition that should be raised at once, rather than transient and particular to the RS member being tried, the way a network error might be. MongoClient might instead treat OperationFailure as it does other exceptions: keep trying more nodes.

      I've briefly tested this and it fixes this bug, at the cost of backward compatibility: if we make the change, auth errors from MongoClient's constructor will raise AutoReconnect instead of the expected OperationFailure.

      Best in my opinion to leave PyMongo as-is and encourage users to upgrade.

            Assignee:
            jesse@mongodb.com A. Jesse Jiryu Davis
            Reporter:
            jesse@mongodb.com A. Jesse Jiryu Davis
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: