Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-46237

Extended majority downtime can cause consequences with majority read concern

    • Type: Icon: Improvement Improvement
    • Resolution: Won't Do
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Replication
    • None
    • Repl 2020-03-09

      Sorry if the title is confusing.  Let me provide some background.

      If a majority of a replica set is down for extended periods of time it's possible for writes to be acknowledged but not make it to any other members and for RCM to cause cache overflow saturation.  For example, in a PSA this is one of the problems.  

      This also happens if you have a PSS where SS are both up but in initial sync or OpLog recovery or lagging, and P has been up for an extended period of time.  In this case P will push all writes into cacheoverflow.  This has the consequence of possibly eventually stalling / OOMing.  

      As a note: this applies primarily to systems where

      • w:1 writes are done
      • secondaries are "up" but not available (i.e. lagging significantly or in recovery/startup2)

      There are some consequences of this

      Recovery for days

      If, for some reason, the Primary is restarted, it will need to apply OpLog from the common point in time (where SS both originally became unavailable for majority commit) and it will have a lot of data from cacheoverflow to apply.

      This means that it could take DAYS for the recovery to complete and the Primary to become available.  Further, due to SERVER-36495, if the Primary fails while recovering, it will have to go through the entire process again because we don't move the commit point forward.  We can also encounter this as a consequence of SERVER-34938.

      Data Loss

      If a Secondary manages to come online (that's days behind at this point) and assumes Primary role it may begin accepting writes.  When the former Primary comes up again it will be forced into ROLLBACK and the customer is left with 3 very undesirable options.
      1. Initial sync and lose all the data that didn't make it to the Secondary
      2. Let the former Primary complete recovery (days of waiting) and force an election and effectively lose the "new" data or
      3. force the Secondary-that-became-the-Primary into rollback and hopefully get some semblance of data continuity

      At the moment, the most effective "workaround" to this is to set votes:0 for the "down" Secondaries.  However, in a scenario where both Secondaries are down and the Primary is in Oplog Apply mode, this isn't possible.

      The above break some basic expectations that customers have re: Replica Sets.  Namely, if it's accepting writes then I won't lose data.  In reality, there could be significant data loss.

      We should have a mechanism where after some pre-determined amount of time, if P is still writing to cacheoverflow (because the S's are unavailable) - either automatically set votes:0 and allow the majority commit point to move forward or error out and stop all writes or add logic that alerts the user to the compromised state of their deployment and actions they can take.

            Assignee:
            evin.roesle@mongodb.com Evin Roesle
            Reporter:
            shakir.sadikali@mongodb.com Shakir Sadikali
            Votes:
            0 Vote for this issue
            Watchers:
            19 Start watching this issue

              Created:
              Updated:
              Resolved: