-
Type: Bug
-
Resolution: Fixed
-
Priority: Critical - P2
-
Affects Version/s: 4.2.5, 4.0.17
-
Component/s: Sharding
-
Fully Compatible
-
ALL
-
Sharding 2019-12-30, Sharding 2020-04-20
-
(copied to CRM)
ISSUE DESCRIPTION AND IMPACT
A bug in shard version checking causes a race condition between parallel chunk migrations and auto-split activity.
If the race condition occurs, an affected shard becomes unable to update its sharding metadata, and operations that require data from that shard will fail.
While it is possible for the issue to clear on its own, it is likely to persist until action is taken.
DIAGNOSIS AND AFFECTED VERSIONS
Sharded clusters with 2 or more shards running MongoDB versions <=4.2.5 and version 4.0.17 are impacted. The bug is much more likely to be triggered on 4.0.17 than other versions, however.
If the bug is triggered, client operations will begin failing with "version mismatch detected" (StaleConfig) errors. And, corresponding mongos logs will include "requested shard version differs from config shard version" error messages.
REMEDIATION AND WORKAROUNDS
If running MongoDB version 4.0.17, downgrade to 4.0.16 or upgrade to 4.0.18 when it becomes available.
If running MongoDB version 4.2.5, upgrade to version 4.2.6 when it becomes available.
In the event a version change is not possible, this issue can be partially mitigated by:
- Disabling the balancer
- Waiting for the balancer to stop running.
- Running the following command on the primary replica set member of each shard:
db.adminCommand({_flushRoutingTableCacheUpdates: ns, syncFromConfig: true})
If you re-enable the balancer, the bug can be triggered again.
Note: If the sharded cluster is running with authentication enabled, you would need to grant the internal action on the cluster resource, to run the _flushRoutingTableCacheUpdates command:
You could create a new role with the internal privilege on the cluster resource, and then grant this role to the admin user as below. Replace ADMIN_USER with the username for the admin.
use admin; db.createRole({ role: "flush_routing_table_cache_updates", privileges: [ { resource: { cluster: true }, actions: [ "internal" ] }, ], roles: [ ] }); db.grantRolesToUser("ADMIN_USER", ["flush_routing_table_cache_updates"])
FIX VERSIONS
4.2.6 and 4.0.18
original description
This should call getShardVersion() instead of getCollVersion(). It's only usage is here. Fortunately the check here is still valid even though we were returning the collection version. Basically if a shard knows about collection version X, and shard version Y, then it's not possible for the actual shard version to be between X and Y, because otherwise it would know about it.
- is duplicated by
-
SERVER-47432 Mongo Server error (MongoQueryException): Query failed with error code 13388 and error message 'Failed to run query after 10 retries :: caused by :: version mismatch detected
- Closed
- is related to
-
SERVER-53338 The best method resolving BUG SERVER-45119 of mongodb 4.2.3 for rhel7 x86_64
- Closed
- mentioned in
-
Page Loading...