Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
- repl-shortlist

Assigned Teams:

Replication
Operating System:
ALL
Case:
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

Today, the settings.replicaSetId field is supposed to always be present in a replica set, per our documentation, but is defined in ReplSetConfigBase as an boost::optional<OID>. This is because users do not specify this field themselves, but instead we generate it at the time of calling replSetInitiate and persist it in the repl set config document at that time. From that point on, we never allow the replicaSetId to be changed via the replSetReconfig command.

It's turned out it's possible for a replica set to lack this field if certain procedures e.g. backup + PIT restore are done.

Initially, my thinking was that backup/PIT restore should be amended to preserve the replicaSetId. However, matthew.russotto@mongodb.com brought up that conceptually it doesn't necessarily make sense to preserve the replicaSetId across a backup/restore, since it's a new cluster and this is supposed to be unique per cluster. That said, it may not be too problematic to reuse the replicaSetId in this case if we have to, since we don't think the case replicaSetId is supposed to protect against (~~SERVER-22287~~) is something that is likely to occur in Atlas.

In doing this ticket we need to answer the following questions:

What is the best way to prevent this issue going forward? Is it a server change to repopulate the replicaSetId after a PIT restore has happened? an MMS change to preserve the existing replicaSetId or generate a new one when doing PIT restore? Something else?
What is the best way to try to resolve this issue in existing affected clusters? For example, could we check at runtime anytime a reconfig is performed if a cluster is lacking replicaSetId and add it back then? Note that to allow a graceful transition period, this might require e.g. relaxing the checks we have today that require both nodes to have the same replicaSetId (or both nodes to lack a replicaSetId) for messages between them to be accepted.
Should we add any new validation around the presence of replicaSetId?
Can we add warning logs about this situation?

related to

SERVER-22287 Merging replica sets with replication protocol version 1 may result in two primaries

Closed

Assignee:: Unassigned
Reporter:: Kaitlin Mahar
Participants:: Kaitlin Mahar
Votes:: 0 Vote for this issue
Watchers:: 12 Start watching this issue

Created:: Aug 21 2024 08:28:25 PM UTC
Updated:: Jan 16 2025 10:31:45 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates