Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 8.1.0-rc0, 8.0.0-rc11
Affects Version/s: None
Component/s: None
Labels:
None

Assigned Teams:

Replication
Backwards Compatibility:
Minor Change
Operating System:
ALL
Backport Requested:

v8.0
Sprint:
Repl 2024-07-08
Case:
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

If a node that was repaired via --repair rejoins a replica set that it was a part of (like in this test), it will go through initial sync. This is because we run repair in cases of potential data loss or corruption, and we don't want the repaired node becoming primary and potentially causing data loss across the replica set.

When the repaired node goes through initial sync, it will have its stable timestamp set to a non-zero value, which is not expected for initial sync. Initial sync does not set the stable timestamp at all, but it does set the oldest timestamp as it applies oplog entries during its oplog application phase. This means that the oldest timestamp is likely to advance past the stable timestamp, and can cause us to take a stable checkpoint in that state when initial sync completes, which can cause unsafe and undefined behavior in the storage engine (see ~~SERVER-84706~~ for more details).
EDIT: This will result in the server hitting an invariant when we attempt to set the oldest timestamp past the stable timestamp, since WT does not allow that.

Since the repaired node is going to go through initial sync anyways, we should require that the node be wiped beforehand. One option is that we require that the stable timestamp be null for all nodes here before starting initial sync.

is related to

SERVER-84706 Investigate if setting the oldest timestamp greater than the stable timestamp can be avoided

Closed

SERVER-35731 Prevent a repaired node from re-joining a replica set

Closed

SERVER-85722 Investigate assumption that mongo layer always tells storage engine when to take a checkpoint

Closed

Assignee:: Sean Zimmerman
Reporter:: Samyukta Lanka
Participants:: Githook User, Samyukta Lanka, Sean Zimmerman
Votes:: 0 Vote for this issue
Watchers:: 10 Start watching this issue

Created:: Jun 25 2024 09:43:21 PM UTC
Updated:: May 16 2025 01:15:24 PM UTC
Resolved:: Jun 27 2024 09:13:03 PM UTC
Confidence Status Last Update:: 26/Jun/24 1:35 PM

Details

Description

Attachments

Issue Links

Activity

People

Dates