A shard role nested into a router role does not handle StaleConfig exception correctly, breaking the shard versioning protocol.
In particular, the StaleConfig exception will be caught and handled by the RouterRole that will invalidate and refresh the catalog cache and retry the operation without updating the Database/Collection Sharding State (CSS/DSS).
Manifestation
In most of the cases, this will simply cause additional latency in the execution of the query/command because the router role will retry 10 times before bubbling up the error. This will let the ServiceEntryPoint on the shard to finally update the DSS/CSS.
In case we are executing inside a transaction the situation is worst, and it could happen that the transaction will never succeed even if the driver keeps retrying it.
In fact, due to the execution of the transaction, the shard needs to grab locks for the collection before to enter into the router role. This implies that after 10 retries the Router Role will bubble up ShardCannotRefreshDueToLocksHeld instead of StaleConfig. When this error reaches the service entry point we only refresh the catalog cache but not the DSS/CSS.
This is one example of where we used nested shard role inside router role. So transactions over views are definitely affected by this problem.
- is caused by
-
SERVER-81233 Prevent kickback to router when reading from views on unsplittable collections located on the db-primary
- Closed
- related to
-
SERVER-77402 Create ShardRole retry loop utility
- Backlog