Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-90120

RecordIds can be reused

    • Type: Icon: Bug Bug
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Storage Execution
    • ALL

      Steady state replication case
      Suppose the following sequence occurs on the primary, which assigns recordIds as inserts come in:

      • Insert {_id: 1}. Oplog entry: {op: "i", _id: 1, rid: 1}
      • Delete {_id: 1}. Oplog entry: {op: "d", _id: 1, rid: 1}
      • Kill and restart the primary
      • Insert {_id: 2}. The primary checks on disk for the highest recordId, and that is currently 0, as no documents exist. Therefore it uses recordId(1), creating a new oplog entry: {op: "i", _id: 2, rid: 1}.

      Now, if the secondary tries to apply these entries as a part of a batch, it may see in a batch:

      [
        {op: "i", _id: 1, rid: 1},
        {op: "d", _id: 1, rid 1},
        {op: "i", _id: 2, rid: 1},
      ]
      

      The secondary assigns oplog entries to the applier threads based on the hash of the _id. Therefore it's possible that -
      Applier thread 1 gets:

      [
        {op: "i", _id: 1, rid: 1},
        {op: "d", _id: 1, rid 1}
      ]
      

      Applier thread 2 gets:

      [
        {op: "i", _id: 2, rid: 1}
      ]
      

      As a result it's possible for the threads to interleave in a way that we are left with data corruption (if applier thread 1 deletes document with recordId(1) after applier thread 2 has inserted).

      Initial sync case
      The reuse of recordIds due to restart is problematic even when the writes don't appear in the same batch.
      Let's say we have a primary -> secondary -> initial syncing node chain.
      The primary generates oplog entries:

      [
        ts: 1  -> {op: "i", _id: 1, rid: 1},
        ts: 2  -> {op: "d", _id: 1, rid 1},
        ... <arbitrary number of oplog entries>
        // recordId reuse due to restart of primary:
        ts: 10 -> {op: "i", _id: 2, rid: 1},
      ]
      

      Initial sync starts at ts: 1, However, by the time collection cloning actually starts, the collection only contains the insert from ts: 10, i.e. the document {_id: 2} with recordId(1).

      After the collection cloning phase of initial sync has completed, we replay oplog entries. But the oplog entry at ts: 1 also writes to recordId(1), although for a different document! And then later, on encountering the delete at ts: 2, we will delete the {_id: 1} document with with recordId(1). Note that in this process, because we overwrote the "recordId -> document" B-tree, this means that the indexes for the document {_id: 2} still exist because we didn't take any extra steps to delete it. So we will have this dangling index entry that points to a non-existent recordId(1), as that was deleted at {[ts: 2}}.

      When we finally get to ts: 10, we try to insert the document {_id: 2} again. Unfortunately, that entry since it already exists in the _id index, we see a duplicate key error and then we carry on without doing any more work, in other words, we don't ever insert the document into the collection.

      Solutions
      See comments

            Assignee:
            Unassigned Unassigned
            Reporter:
            vishnu.kaushik@mongodb.com Vishnu Kaushik
            Votes:
            0 Vote for this issue
            Watchers:
            16 Start watching this issue

              Created:
              Updated: