-
Type: Bug
-
Resolution: Done
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
ALL
-
I have a shard cluster with 4000 shard collections. After executing stepdown on one of the shards, many errors will occur when executing transactions:
(StaleConfig) Transaction 6ba3e857-e289-4ab1-a63c-c038a18bfc6c:614 was aborted on statement 1 due to: an error from cluster data placement change :: caused by :: Encountered error from xx.xx.xx.xx:xxxx during a transaction :: caused by :: epoch mismatch detected for xx.xx, the collection may have been dropped and recreated find from config server's log: [PeriodicShardedIndexConsistencyChecker] Attempt 0 to check index consistency for millionGroup.g_m_version1 received StaleShardVersion error :: caused by :: StaleConfig{ ns: "millionGroup.g_m_version1", vReceived: Timestamp(1, 3), vReceivedEpoch: ObjectId('6189f321bbcd3f66776bbe8a'), vWanted: Timestamp(0, 0), vWantedEpoch: ObjectId('000000000000000000000000') }: epoch mismatch detected for millionGroup.g_m_version1, the collection may have been dropped and recreated
Similarly, after adding a shard to the shard cluster, many errors will occur when executing transactions:
(StaleConfig) Transaction 324be44f-a3d4-4ee5-9fc4-9bbff6d53ffe:25 was aborted on statement 0 due to: an error from cluster data placement change :: caused by :: Encountered error from xx.xx.xx.xx:xxxx during a transaction :: caused by :: version mismatch detected for xx.xx
For the latter case, `jstests/sharding/transactions_stale_shard_version_errors.js` explains that transaction failure is an expected behavior after chunk migration.
And I can solve the above two problems by executing findOne (readpref is PrimaryMode) on each collection before executing the transaction after stepdown or chunk migration.
So my questions and suggestions are:
1. Is it an expected behavior that the first transaction executed on each collection is aborted after the stepdown is complete?
2. Why can't the catalog cache (or somethingelse) on the shard be updated in time to ensure that the transaction will not be aborted because of epoch/version mismatch?