-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: 8.1.0-rc0, 8.0.0-rc13
-
Component/s: Query Execution
-
None
-
Query Execution
-
Fully Compatible
-
ALL
-
v8.0
-
-
QE 2024-07-22, QE 2024-08-05
-
0
SPM-3229 introduced the usage of the routing role API in queries, for example, SERVER-83751 changed how the $unionWith stage uses the mongoProcessInterface, it made sure that the routing role API is used ensuring query correctness in regards to data placement by honoring the shard versioning protocol in the execution of queries. However, the explain code is not currently using the routing role API, making the following scenario possible:
Suppose a sharded cluster with 3 shards: shard0, shard1 and shard2
- We create a database db, and it's primary shard is shard0
- We create two collections: coll1 and coll2
- We move coll1 to shard2 with moveCollection
- We issue a query to shard2 that refreshes both caches for coll1 and coll2, like an aggregation with a $unionWith stage for example
- We move coll2 to shard1, making shard2 cache of coll2 stale
- We issue an explain of an aggregation with the $unionWith stage, with coll1 as outer collection, and coll2 as inner collection
- The explain command will be sent first to shard2, and then to shard0, because of the stale cache for coll2
- Considering the code pointed at above, the cache of coll2 will not be refreshed in shard2, the exception will be thrown back to the router
- The router will exhaust it's retries and the query will fail
You can find a reproducible attached to the ticket. We should ensure the shard role API is being used not only in the stage code, but also in the explain path.