-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: 3.6.3, 3.7.3
-
Component/s: Sharding
-
Fully Compatible
-
v4.0, v3.6
-
Sharding 2018-04-09, Sharding 2018-04-23, Sharding 2018-05-07, Sharding 2018-05-21, Sharding 2018-06-04, Sharding 2018-06-18
-
73
The CatalogCache refresh methods have a logic that will simply 'join' with an another thread if there is already another thread trying to refresh the same collection. This will cause the refresh to miss the changes that happened at Tn if the refresh method was called while an in-progress refresh started at Tn-1.
Here's a concrete example based on the build failure:
1. moveChunk command is called.
2. At the end of the migration, source shardA sends setShardVersion command to recipient shardB asynchronously.
3. setShardVersion command triggers a CatalogCache refresh at shardB.
4. Since step 3 is happening asynchronously, the test can proceed to dropping the collection.
5. At the final step of drop is to send setShardVersion (0, 0) to shards.
6. When setShardVersion arrives at shardB while step 3 is still refreshing, it will simply join with it and get an old info where the collection has not dropped yet.
7. setShardVersion on drop fails since version (0, 0) would not match the version found in the refresh.
Original description:
After `test.foo` is dropped, the drop is not reflected in the catalogCache. After refreshing, the shard logs that it refreshed from the old version:
Refresh for collection test.foo took 92 ms and found version 3|0||5aac36f988c1f30185b7b1df
and the test fails because when trying to setShardVersion during the drop, the correct version is sent (version 0|0||000000000000000000000000), but the incorrect version is found.
Attached is the test to reproduce the error as well as the logs.
- is duplicated by
-
SERVER-31659 Investigate causal consistency violation when getting errors from config server
- Closed
- related to
-
SERVER-43520 Complete TODO listed in SERVER-33954
- Closed