Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-66116

Aborted Read with MongoNotPrimaryException

    • Type: Icon: Bug Bug
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 4.4.9
    • Component/s: None
    • None
    • Replication
    • ALL
    • Hide

      Grab a Jepsen environment with five nodes and https://github.com/jepsen-io/mongodb at da4a3fcef9298b4658db435991a402afe7497f00, then run (e.g.):

      lein run test --nodes-file ~/nodes -w list-append -r 1000 --concurrency 3n --max-writes-per-key 16 --read-concern majority --write-concern majority --txn-read-concern snapshot --txn-write-concern majority --time-limit 300 --nemesis partition --test-count 5

       

      Show
      Grab a Jepsen environment with five nodes and https://github.com/jepsen-io/mongodb at da4a3fcef9298b4658db435991a402afe7497f00, then run (e.g.): lein run test --nodes-file ~/nodes -w list-append -r 1000 --concurrency 3n --max-writes-per-key 16 --read-concern majority --write-concern majority --txn-read-concern snapshot --txn-write-concern majority --time-limit 300 --nemesis partition --test-count 5  
    • Repl 2022-05-16, Repl 2022-05-30, Repl 2022-06-13, Repl 2022-06-27, Repl 2022-07-11, Repl 2022-08-08, Repl 2022-08-22, Repl 2022-09-05, Repl 2022-09-19, Repl 2022-07-25, Repl 2022-10-03

      It looks like MongoNotPrimaryException (or whatever the protocol response is that triggers this error in the Java driver) might actually be an indefinite error, rather than a definite failure. Consider this pair of operations from a Jepsen list-append test:

       

      {:type :fail, :f :txn, :value [[:append 855 3]], :time 36272337272, :process 36, :error :not-primary, :index 56335}
      {:type :ok, f :txn, value [[:r 855 [3]]], time 38283284542, process 42, index 57897}, 
      

      In this case both "transactions" are actually single-document operations. The first operation performs a single findAndModify to $push the number 3 onto a list in document 855; that write threw a MongoNotPrimaryException. The second is a read of document 855, which observed that write of 3.

      The documentation for MongoNotPrimaryException says that the server "refused to execute... a write operation", which seems fairly plain: the write of 3 must not have happened. Since we go on to read 3, this looks like an aborted read.

      This problem occurs with MongoDB 4.4.9 and Java driver 4.6.0, write concern majority, read concern snapshot/majority, and is reproducible using network partitions.

      It also looks like MongoWriteConcernWithResponseException with a message containing "InterruptedDueToReplStateChange" may also do the same thing, but I'm less sure whether this error should be interpreted as definite or not.

            Assignee:
            matthew.russotto@mongodb.com Matthew Russotto
            Reporter:
            aphyr@jepsen.io Kyle Kingsbury
            Votes:
            0 Vote for this issue
            Watchers:
            28 Start watching this issue

              Created:
              Updated: