-
Type: Bug
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Storage Execution
-
ALL
Steady state replication case
Suppose the following sequence occurs on the primary, which assigns recordIds as inserts come in:
- Insert {_id: 1}. Oplog entry: {op: "i", _id: 1, rid: 1}
- Delete {_id: 1}. Oplog entry: {op: "d", _id: 1, rid: 1}
- Kill and restart the primary
- Insert {_id: 2}. The primary checks on disk for the highest recordId, and that is currently 0, as no documents exist. Therefore it uses recordId(1), creating a new oplog entry: {op: "i", _id: 2, rid: 1}.
Now, if the secondary tries to apply these entries as a part of a batch, it may see in a batch:
[ {op: "i", _id: 1, rid: 1}, {op: "d", _id: 1, rid 1}, {op: "i", _id: 2, rid: 1}, ]
The secondary assigns oplog entries to the applier threads based on the hash of the _id. Therefore it's possible that -
Applier thread 1 gets:
[ {op: "i", _id: 1, rid: 1}, {op: "d", _id: 1, rid 1} ]
Applier thread 2 gets:
[
{op: "i", _id: 2, rid: 1}
]
As a result it's possible for the threads to interleave in a way that we are left with data corruption (if applier thread 1 deletes document with recordId(1) after applier thread 2 has inserted).
Initial sync case
The reuse of recordIds due to restart is problematic even when the writes don't appear in the same batch.
Let's say we have a primary -> secondary -> initial syncing node chain.
The primary generates oplog entries:
[ ts: 1 -> {op: "i", _id: 1, rid: 1}, ts: 2 -> {op: "d", _id: 1, rid 1}, ... <arbitrary number of oplog entries> // recordId reuse due to restart of primary: ts: 10 -> {op: "i", _id: 2, rid: 1}, ]
Initial sync starts at ts: 1, However, by the time collection cloning actually starts, the collection only contains the insert from ts: 10, i.e. the document {_id: 2} with recordId(1).
After the collection cloning phase of initial sync has completed, we replay oplog entries. But the oplog entry at ts: 1 also writes to recordId(1), although for a different document! And then later, on encountering the delete at ts: 2, we will delete the {_id: 1} document with with recordId(1). Note that in this process, because we overwrote the "recordId -> document" B-tree, this means that the indexes for the document {_id: 2} still exist because we didn't take any extra steps to delete it. So we will have this dangling index entry that points to a non-existent recordId(1), as that was deleted at {[ts: 2}}.
When we finally get to ts: 10, we try to insert the document {_id: 2} again. Unfortunately, that entry since it already exists in the _id index, we see a duplicate key error and then we carry on without doing any more work, in other words, we don't ever insert the document into the collection.
Solutions
See comments
- related to
-
SERVER-88309 Prevent user from inserting doc via applyOps with recordId that already exists
- Open