Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-31924

OplogStones can capture record ids to truncate out of order

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Storage
    • Storage Execution
    • Fully Compatible
    • ALL
    • 0

      Capped collections in WiredTiger normally trigger deletes as inserts are performed. For performance reasons, the oplog truncates old documents in batches. This is done via a data structure known as OplogStones.

      A background thread is triggered to periodically look at the oplog size and it may then choose to call reclaimOplog

      reclaimOplog calls one truncate for each OplogStone popped from the beginning of _oplogStones. The truncate method deletes records ranging from the previous stone's lastRecord (saved on the _oplogStones->firstRecord here) to the current stone's lastRecord.

      The invariant for this to work is that the lastRecord in consecutive stones must be increasing. As inserts to the oplog commit, their recordId will increase the lastRecord "if applicable".

      Why "if applicable"? With document level locking storage engines, transactions can commit out of timestamp order. Since these RecordIds are* the timestamp values in disguise, the OplogStone datastructure has to deal with record ids arriving out of order.

      The only piece wrong with that logic is that if !_stones.empty() returns false (i.e: all the existing stones have been purged), we will unconditionally create a new stone. This stone will have the recordId that committed. Because there are no stones, the code did not validate it would be consumed in a valid call to truncate.

      The corollary logic (for demonstration, not the required solution) here would be to then check (against firstRecord)

      if (_stones.empty() && lastRecord < firstRecord) {
        return;
      }
      

      This would protect the code from attempting a truncation where "start > stop".

      One idea is to never pass in the start cursor to WiredTiger. Always let WiredTiger handle positioning the cursor to the beginning of the oplog (oldest record) to start truncating from.

            Assignee:
            backlog-server-execution [DO NOT USE] Backlog - Storage Execution Team
            Reporter:
            daniel.gottlieb@mongodb.com Daniel Gottlieb (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

              Created:
              Updated:
              Resolved: