-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: 5.0.0, 5.1.0-rc2
-
Component/s: Sharding
-
Fully Compatible
-
v5.1, v5.0
-
Sharding 2021-11-01
-
1
Cause
If the update to the config server when updating the coordinator document takes longer than the specified wTimeout, the command will fatally error.
Context
Both the ReshardingRecipientService and ReshardingDonorService need to update the coordinator document in the config server with the latest state of the donor and recipient accordingly. While both attempt to retry errors that occur, not every error is retryable. One of these non-retryable errors is WriteConcernFailed, which can be generated from write timeout errors.
Problem
The commands used by the resharding components are designed to simply wait for the results of their command, so we shouldn't throw an error if the write is taking a long time.
In the ReshardingDonorService, the updateCoordinatorDocument function utilizes the CatalogClient to update the coordinator document in the config server. Remove the usage of `ShardingCatalogClient::kMajorityWriteConcern` in its call to updateConfigDocument in the catalogClient. *It should be replaced with
{w: "majority"}specified with no wtimeout.*
In the RecipientStateMachineExternalStateImpl, the updateCoordinatorDocument function utilizes the CatalogClient to update the coordinator document in the config server. Remove the usage of `ShardingCatalogClient::kMajorityWriteConcern` in its call to updateConfigDocument in the catalogClient. *It should be replaced with
{w: "majority"}specified with no wtimeout.*
Source
You can see the following log lines in this Evergreen Patch
756555:[j0:s1:n1] {"t":{"$date":"2021-10-27T18:51:32.274+00:00"},"s":"F", "c":"RESHARD", "id":5160600, "ctx":"ReshardingDonorService-1","msg":"Unrecoverable error occurred past the point donor was prepared to complete the resharding operation","attr":{"error":"WriteConcernFailed: waiting for replication timed out; Error details: { wtimeout: true, writeConcern: { w: \"majority\", wtimeout: 60000, provenance: \"clientSupplied\" } }"}} 756556:[j0:s1:n1] {"t":{"$date":"2021-10-27T18:51:32.274+00:00"},"s":"F", "c":"ASSERT", "id":23089, "ctx":"ReshardingDonorService-1","msg":"Fatal assertion","attr":{"msgid":5160600,"file":"src/mongo/db/s/resharding/resharding_donor_service.cpp","line":459}} 756557:[j0:s1:n1] {"t":{"$date":"2021-10-27T18:51:32.275+00:00"},"s":"F", "c":"ASSERT", "id":23090, "ctx":"ReshardingDonorService-1","msg":"\n\n***aborting after fassert() failure\n\n"}
- is depended on by
-
SERVER-57686 We need test coverage that runs resharding in the face of elections
- Closed
- related to
-
SERVER-82838 ReshardingOplogApplier uses {w: "majority", wtimeout: 60000} write concern when persisting resharding oplog application progress
- Closed