-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: Sharding
-
Fully Compatible
-
ALL
-
Sharding 2022-09-05
-
138
-
3
Per the following logs in the failing test:
[js_test:resharding_nonblocking_coordinator_rebuild] c20779| 2022-08-15T22:34:31.656+00:00 W CONTROL 23829 [conn1] "Set failpoint","attr":{"failPointName":"reshardingPauseCoordinatorBeforeCompletion","failPoint":{"mode":0,"data":{},"timesEntered":0}} [js_test:resharding_nonblocking_coordinator_rebuild] c20779| 2022-08-15T22:34:33.138+00:00 I RESHARD 4956902 [ReshardingCoordinatorService-0] "Resharding failed","attr":{"namespace":"reshardingDb.coll","newShardKeyPattern":{"newKey":1},"error":{"code":5356800,"codeName":"Location5356800","errmsg":"Recipient shard shard2-recipient0 reached an unrecoverable error :: caused by :: Resharding completed with non-empty stash collections"}} .. .. [js_test:resharding_nonblocking_coordinator_rebuild] c20779| 2022-08-15T22:34:34.792+00:00 I RESHARD 5343001 [ReshardingCoordinatorService-1] "Transitioned resharding coordinator state","attr":{"newState":"initializing","oldState":"unused","namespace":"reshardingDb.coll","collectionUUID":{"uuid":{"$uuid":"98d3baa8-562a-4c71-8156-8cb71202bc58"}},"reshardingUUID":{"uuid":{"$uuid":"afc0b895-e087-436d-a9d8-b141ea128266"}}} .. .. [js_test:resharding_nonblocking_coordinator_rebuild] d20770| 2022-08-15T22:34:37.057+00:00 I COMMAND 51803 [conn64] "Slow query","attr":{"type":"command","ns":"admin.$cmd","appName":"MongoDB Shell","command":{"_shardsvrReshardCollection":"reshardingDb.coll","key":{"newKey":1},"_presetReshardedChunks":[{"recipientShardId":"shard2-recipient0","min":{"newKey":{"$minKey":1}},"max":{"newKey":{"$maxKey":1}}}],"writeConcern":{"w":"majority","wtimeout":60000},"databaseVersion":{"uuid":{"$uuid":"76d42834-8a2e-4942-bc32-d520dae1731c"},"timestamp":{"$timestamp":{"t":1660602860,"i":3}},"lastMod":1},"lsid":{"id":{"$uuid":"62f0a515-4f10-4caa-844c-a07714eadb7c"},"uid":{"$binary":{"base64":"u4nTF1+wmByGgmwndZCCo3FgRx9gUEtGEkFRhsYwq3A=","subType":"0"}}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1660602870,"i":25}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"$configTime":{"$timestamp":{"t":1660602870,"i":24}},"$topologyTime":{"$timestamp":{"t":1660602860,"i":1}},"$audit":{"$impersonatedUsers":[{"user":"__system","db":"local"}],"$impersonatedRoles":[]},"$client":{"application":{"name":"MongoDB Shell","pid":"32822"},"driver":{"name":"MongoDB Internal Client","version":"5.0.11-6-ga2642ba"},"os":{"type":"Linux","name":"Ubuntu","architecture":"x86_64","version":"18.04"},"mongos":{"host":"ip-10-128-0-163:20782","client":"10.128.0.163:49632","version":"5.0.11-6-ga2642ba"}},"$configServerState":{"opTime":{"ts":{"$timestamp":{"t":1660602870,"i":24}},"t":-1}},"$db":"admin"},"numYields":0,"ok":0,"errMsg":"Recipient shard shard2-recipient0 reached an unrecoverable error :: caused by :: E11000 duplicate key error collection: reshardingDb.system.resharding.98d3baa8-562a-4c71-8156-8cb71202bc58 index: _id_ dup key: { _id: 0.0 }","errName":"DuplicateKey","errCode":11000,"reslen":682,"locks":{"ParallelBatchWriterMode":{"acquireCount":{"r":1}},"FeatureCompatibilityVersion":{"acquireCount":{"r":1}},"ReplicationStateTransition":{"acquireCount":{"w":1}},"Global":{"acquireCount":{"r":1}},"Database":{"acquireCount":{"r":1}},"Mutex":{"acquireCount":{"r":1}}},"writeConcern":{"w":"majority","wtimeout":60000,"provenance":"clientSupplied"},"remote":"10.128.0.163:44006","protocol":"op_msg","durationMillis":2359}
It can be seen that the first _configsvrReshardCollection command fails as it is supposed to with code 5356800 during oplog application. However a subsequent retry of the same command after the first one finishes causes DuplicateKey (11000) due to the global id constraint violation found in cloning phase for second reshard collection command rather than in oplog application phase. From the following logs, it can be seen that it is the _shardsvrReshardCollection command that is being retried, unlike BF-26040:
[js_test:resharding_nonblocking_coordinator_rebuild] d20770| 2022-08-15T22:34:30.663+00:00 I COMMAND 51803 [conn64] "Slow query","attr":{"type":"command","ns":"admin.$cmd","appName":"MongoDB Shell","command":{"_shardsvrReshardCollection":"reshardingDb.coll","key":{"newKey":1},"_presetReshardedChunks":[{"recipientShardId":"shard2-recipient0","min":{"newKey":{"$minKey":1}},"max":{"newKey":{"$maxKey":1}}}],"writeConcern":{"w":"majority","wtimeout":60000},"databaseVersion":{"uuid":{"$uuid":"76d42834-8a2e-4942-bc32-d520dae1731c"},"timestamp":{"$timestamp":{"t":1660602860,"i":3}},"lastMod":1},"lsid":{"id":{"$uuid":"62f0a515-4f10-4caa-844c-a07714eadb7c"},"uid":{"$binary":{"base64":"u4nTF1+wmByGgmwndZCCo3FgRx9gUEtGEkFRhsYwq3A=","subType":"0"}}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1660602862,"i":29}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"$configTime":{"$timestamp":{"t":1660602862,"i":29}},"$topologyTime":{"$timestamp":{"t":1660602860,"i":1}},"$audit":{"$impersonatedUsers":[{"user":"__system","db":"local"}],"$impersonatedRoles":[]},"$client":{"application":{"name":"MongoDB Shell","pid":"32822"},"driver":{"name":"MongoDB Internal Client","version":"5.0.11-6-ga2642ba"},"os":{"type":"Linux","name":"Ubuntu","architecture":"x86_64","version":"18.04"},"mongos":{"host":"ip-10-128-0-163:20782","client":"10.128.0.163:49632","version":"5.0.11-6-ga2642ba"}},"$configServerState":{"opTime":{"ts":{"$timestamp":{"t":1660602862,"i":29}},"t":-1}},"$db":"admin"},"numYields":0,"ok":0,"errMsg":"Not Primary when trying to create a new instance of PrimaryOnlyService ReshardingCoordinatorService","errName":"NotWritablePrimary","errCode":10107,"reslen":574,"locks":{"ParallelBatchWriterMode":{"acquireCount":{"r":1}},"FeatureCompatibilityVersion":{"acquireCount":{"r":1}},"ReplicationStateTransition":{"acquireCount":{"w":1}},"Global":{"acquireCount":{"r":1}},"Database":{"acquireCount":{"r":1}},"Mutex":{"acquireCount":{"r":1}}},"writeConcern":{"w":"majority","wtimeout":60000,"provenance":"clientSupplied"},"remote":"10.128.0.163:44006","protocol":"op_msg","durationMillis":6497} ... [js_test:resharding_nonblocking_coordinator_rebuild] d20770| 2022-08-15T22:34:37.057+00:00 I COMMAND 51803 [conn64] "Slow query","attr":{"type":"command","ns":"admin.$cmd","appName":"MongoDB Shell","command":{"_shardsvrReshardCollection":"reshardingDb.coll","key":{"newKey":1},"_presetReshardedChunks":[{"recipientShardId":"shard2-recipient0","min":{"newKey":{"$minKey":1}},"max":{"newKey":{"$maxKey":1}}}],"writeConcern":{"w":"majority","wtimeout":60000},"databaseVersion":{"uuid":{"$uuid":"76d42834-8a2e-4942-bc32-d520dae1731c"},"timestamp":{"$timestamp":{"t":1660602860,"i":3}},"lastMod":1},"lsid":{"id":{"$uuid":"62f0a515-4f10-4caa-844c-a07714eadb7c"},"uid":{"$binary":{"base64":"u4nTF1+wmByGgmwndZCCo3FgRx9gUEtGEkFRhsYwq3A=","subType":"0"}}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1660602870,"i":25}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"$configTime":{"$timestamp":{"t":1660602870,"i":24}},"$topologyTime":{"$timestamp":{"t":1660602860,"i":1}},"$audit":{"$impersonatedUsers":[{"user":"__system","db":"local"}],"$impersonatedRoles":[]},"$client":{"application":{"name":"MongoDB Shell","pid":"32822"},"driver":{"name":"MongoDB Internal Client","version":"5.0.11-6-ga2642ba"},"os":{"type":"Linux","name":"Ubuntu","architecture":"x86_64","version":"18.04"},"mongos":{"host":"ip-10-128-0-163:20782","client":"10.128.0.163:49632","version":"5.0.11-6-ga2642ba"}},"$configServerState":{"opTime":{"ts":{"$timestamp":{"t":1660602870,"i":24}},"t":-1}},"$db":"admin"},"numYields":0,"ok":0,"errMsg":"Recipient shard shard2-recipient0 reached an unrecoverable error :: caused by :: E11000 duplicate key error collection: reshardingDb.system.resharding.98d3baa8-562a-4c71-8156-8cb71202bc58 index: _id_ dup key: { _id: 0.0 }","errName":"DuplicateKey","errCode":11000,"reslen":682,"locks":{"ParallelBatchWriterMode":{"acquireCount":{"r":1}},"FeatureCompatibilityVersion":{"acquireCount":{"r":1}},"ReplicationStateTransition":{"acquireCount":{"w":1}},"Global":{"acquireCount":{"r":1}},"Database":{"acquireCount":{"r":1}},"Mutex":{"acquireCount":{"r":1}}},"writeConcern":{"w":"majority","wtimeout":60000,"provenance":"clientSupplied"},"remote":"10.128.0.163:44006","protocol":"op_msg","durationMillis":2359}
max.hirschhorn@mongodb.com pointed out about changes in how _shardsvrReshardCollection is handled between v5.0 and later branches with SERVER-62064.
The NotWritablePrimary error is being propagated to mongos via ShardingDDLCoordinator_NORESILIENT after exhausting _configsvReshardCollection command retries which in turn leads to a retry of _shardsvrReshardCollection from mongos. Essentially, if the command is retried or sent again to primary shard from mongos then it waits to acquire ScopedDistLock before creating an instance of ReshardCollectionCoordinator. This happens after the previous operation completes fully, not allowing a retry to possibly join the existing operation.
A fix here would be to recommit SERVER-61607 and accept DuplicateKey as a possible error code.