-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: 4.0.26
-
Component/s: Sharding
-
None
-
Fully Compatible
-
ALL
-
v4.2
-
Sharding EMEA 2021-10-04, Sharding EMEA 2021-10-18, Sharding EMEA 2021-11-01, Sharding EMEA 2021-11-15, Sharding EMEA 2021-11-29, Sharding EMEA 2021-12-13, Sharding EMEA 2021-12-27
Recently, we have some sharding cluster with version 4.0.26. sometime we will get a result that update operation is extremely slow, about tens of seconds to a few minutes.
After in-depth analysis, I think it's a BUG here.
First , when moveChunk happens, A chunk will move from shard A to shard B , B will cleanup this chunk data first and will wait for cleanup to make sure that the new chunk data wouldn't delete by another older cleanup task. That is , moveChunk will cost a very long time, up to 15 minutes (rangeDeleterBatchDelayMS ) .
// Wait for any other, overlapping queued deletions to drain
auto status = CollectionShardingRuntime::waitForClean(opCtx, _nss, _epoch, footprint);
Secondly, there is a jara https://jira.mongodb.org/browse/SERVER-56779 , and from 4.0.26 , MongoDB do not use collection distributed lock for chunk merges,and use the ActiveMigrationsRegistry. But it cause a new sense
* - Move || Move (same chunk): The second move will join the first
* - Move || Move (different chunks or collections): The second move will result in a
* ConflictingOperationInProgress error
* - Move || Split/Merge (same collection): The second operation will block behind the first
* - Move/Split/Merge || Split/Merge (for different collections): Can proceed concurrently
That is split will be blocked by movechunk until the moveChunk ended.
last, in 4.0.26 ,the auto-split is alse trigger by mongos, and is a part of the update operation.
So sometimes there is such a scene, a chunk moved from shard A to shard B , and then it is moved from shard B to shard A, the second moveChunk task will be blocked, up to 15 minutes。then the update operation will be blocked by splitChunk, and splitchunk is waiting for last moveChunk
from 4.2 ,auto-split is triggered by mongod , and it's an asynchronous task. So this problem is only affect 4.0.26.
- is caused by
-
SERVER-56779 Do not use the collection distributed lock for chunk merges
- Closed