-
Type: Bug
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Catalog and Routing
-
ALL
While a shard is being added, there is a possibility that it might begin to serve some requests, such as incoming migrations, before officially becoming a part of the cluster. Through testing, we have noticed that this could result in inconsistencies and race conditions that are challenging to detect and manage. The most common scenario for such problems to occur is when a node in the cluster attempts to target the shard being added, incorrectly assuming it is targeting a shard that had been removed. This error can only occur if the new shard has the same host:port as a previously removed shard.
Purpose of this ticket is to:
- Identify which commands must not be served before addShard completes
- Figure out a way for a shard to reject such commands while it's not yet part of the cluster
Example of an occurrence of the issue observed while testing add/remove shard with balancing in background:
- Balancer plans to schedule a migration for the sessions collection towards Shard A with host and port foo:123
- Shard A with host and port foo:123 gets removed
- Shard A starts being added with host and port foo:123
- Balancer issues migration
- Migration starts (note that shard A is not part of the cluster yet)
- Sessions collection is dropped as part of addShard (deleting data that were already cloned and implicitly recreating the collection with a local UUID inconsistent with the sharding catalog)
- Shard A actually becomes part of the cluster
Note that in the example the shard was being added with the same name, that allowed the migration to proceed. However, in general we can't exclude that different issues can arise also when adding a shard with a brand new name re-using host and port from a previously removed shard.
- is duplicated by
-
SERVER-91706 During testing, make sure that we don't re-add the exact same shard that was previously removed
- Closed