Uploaded image for project: 'Mongoid'
  1. Mongoid
  2. MONGOID-1421

Retrying to connect when the replica set is reconfigured

    • Type: Icon: Task Task
    • Resolution: Done
    • 2.4.0
    • Affects Version/s: None
    • Component/s: None

      When a replica set is reconfigured (e.g. forcing a member to be primary) the mongo driver may raise a Mongo::OperationFailure error, with a message "10054: not master". This happens because the current master has changed, but the Mongo connection still points to the previous one. Reconnecting after this error seems to work, but only after the new primary has been elected (which can take some time).

      While this could be handled by the application, it would make sense to handle this error and attempt to reconnect. In fact, mongoid already does this in Mongo::Collections::Retry module, but it only rescue from Mongo::ConnectionFailure. The only difference is that Mongo::OperationFailure could be raised with other error messages, meaning different kind of errors, specially when using safe mode (you can check for it in here).

      My first attempt to solve this would be to add another rescue like this:

      def retry_on_connection_failure
      retries = 0
      begin
      yield
      rescue Mongo::ConnectionFailure => ex
      retries += 1
      raise ex if retries > Mongoid.max_retries_on_connection_failure
      Kernel.sleep(0.5)
      retry
      rescue Mongo::OperationFailure => ex
      if ex.message =~ /not master/

      1. master has changed, retrying to connect
        retries += 1
        raise ex if retries > Mongoid.max_retries_on_connection_failure
        Kernel.sleep(0.5)
        retry
        else
      2. some other Mongo::OperationFailure error, re-raising it
        raise ex
        end
        end
        end

      Any suggestions on this topic?

            Assignee:
            Unassigned Unassigned
            Reporter:
            vicentemundim Vicente Mundim
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

              Created:
              Updated:
              Resolved: