-
Type: Bug
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
Cluster Scalability
-
ALL
When fsyncLock is invoked on a router, it contacts the primary of every shard and makes sure there are no ongoing DDLs in order not to incur in inconsistencies during backups. This protocol is currently not resilient to elections.
Example of breaking scenario
Let's consider a shard with 3 nodes: n0, n1 and n2. The primary was n0 but just switched to n1.
- The router believes n0 is primary, asks to acquire the fsync lock
- Since the command is allowed on secondaries, n0 acquires the lock and returns successfully
- A DDL starts on n1 since the coordinator document can be majority committed replicating to n2
- Backup starts from n1 or n2