Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Assigned Teams:

Storage Execution
Operating System:
ALL

Steady state replication case
Suppose the following sequence occurs on the primary, which assigns recordIds as inserts come in:

Insert {_id: 1}. Oplog entry: {op: "i", _id: 1, rid: 1}
Delete {_id: 1}. Oplog entry: {op: "d", _id: 1, rid: 1}
Kill and restart the primary
Insert {_id: 2}. The primary checks on disk for the highest recordId, and that is currently 0, as no documents exist. Therefore it uses recordId(1), creating a new oplog entry: {op: "i", _id: 2, rid: 1}.

Now, if the secondary tries to apply these entries as a part of a batch, it may see in a batch:

[
  {op: "i", _id: 1, rid: 1},
  {op: "d", _id: 1, rid 1},
  {op: "i", _id: 2, rid: 1},
]

The secondary assigns oplog entries to the applier threads based on the hash of the _id. Therefore it's possible that -
Applier thread 1 gets:

[
  {op: "i", _id: 1, rid: 1},
  {op: "d", _id: 1, rid 1}
]

Applier thread 2 gets:

[
  {op: "i", _id: 2, rid: 1}
]

As a result it's possible for the threads to interleave in a way that we are left with data corruption (if applier thread 1 deletes document with recordId(1) after applier thread 2 has inserted).

Initial sync case
The reuse of recordIds due to restart is problematic even when the writes don't appear in the same batch.
Let's say we have a primary -> secondary -> initial syncing node chain.
The primary generates oplog entries:

[
  ts: 1  -> {op: "i", _id: 1, rid: 1},
  ts: 2  -> {op: "d", _id: 1, rid 1},
  ... <arbitrary number of oplog entries>
  // recordId reuse due to restart of primary:
  ts: 10 -> {op: "i", _id: 2, rid: 1},
]

Initial sync starts at ts: 1, However, by the time collection cloning actually starts, the collection only contains the insert from ts: 10, i.e. the document {_id: 2} with recordId(1).

After the collection cloning phase of initial sync has completed, we replay oplog entries. But the oplog entry at ts: 1 also writes to recordId(1), although for a different document! And then later, on encountering the delete at ts: 2, we will delete the {_id: 1} document with with recordId(1). Note that in this process, because we overwrote the "recordId -> document" B-tree, this means that the indexes for the document {_id: 2} still exist because we didn't take any extra steps to delete it. So we will have this dangling index entry that points to a non-existent recordId(1), as that was deleted at {[ts: 2}}.

When we finally get to ts: 10, we try to insert the document {_id: 2} again. Unfortunately, that entry since it already exists in the _id index, we see a duplicate key error and then we carry on without doing any more work, in other words, we don't ever insert the document into the collection.

Solutions
See comments

related to

SERVER-88309 Prevent user from inserting doc via applyOps with recordId that already exists

Open

Assignee:: Unassigned

Reporter:: Vishnu Kaushik

Participants:: Vishnu Kaushik

Votes:: 0 Vote for this issue

Watchers:: 16 Start watching this issue

Created:: May 02 2024 05:57:46 PM UTC

Updated:: May 28 2024 07:45:09 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates