Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
- car-investigation

Assigned Teams:

Catalog and Routing
Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Sprint:
CAR Team 2024-03-04, CAR Team 2024-03-18

Suppose we have a shard that's attempting to commit a DDL operation. Before doing so we may refresh data from the config shard in order to verify if a previous node already did so and failed after doing the operation on the config shard.

This behavior is problematic if we rely on the gossiped Vector Clock since we could end up mistakenly failing the check above and performing the same operation twice.

This can occur in the following scenario:

Shard S1 has three nodes.
Config Shard CS has three nodes.
S1's Primary commits the DDL operation on CS with majority writeConcern and performs a stepdown before it persists the new vector clock.
S1's new primary chosen has the previous Vector Clock.
S1's new primary refreshes its catalog metadata by contacting a stale CS node that is still observing the old Vector Clock and is at a stale majority timestamp. This can happen because we do not have a PrimaryOnly readPreference for this read.
S1's new primary fails the check since from it's perspective we're still in the old pre-commit world.
S1's new primary then re-commits the DDL operation.

is related to

SERVER-87977 Add the explicit replay protection to the commit phase of the sharding ConvertToCappedCoordinator

Closed

related to

SERVER-85534 Checkpoint the vector clock after committing shard collection

Closed

Assignee:: Paolo Polato

Reporter:: Jordi Olivares Provencio

Participants:: Jordi Olivares Provencio, Paolo Polato

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Created:: Feb 13 2024 04:02:24 PM UTC

Updated:: Mar 14 2024 03:05:24 PM UTC

Resolved:: Mar 14 2024 03:05:23 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates