Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 5.2.0, 5.0.4, 5.1.0-rc3
Affects Version/s: 5.0.0, 5.1.0-rc2
Component/s: Sharding
Labels:
- sharding-nyc-subteam1

Backwards Compatibility:
Fully Compatible
Backport Requested:

v5.1, v5.0
Sprint:
Sharding 2021-11-01
Story Points:
1
Confidence Status:
None
Work Order:
0

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

Cause
If the update to the config server when updating the coordinator document takes longer than the specified wTimeout, the command will fatally error.

Context
Both the ReshardingRecipientService and ReshardingDonorService need to update the coordinator document in the config server with the latest state of the donor and recipient accordingly. While both attempt to retry errors that occur, not every error is retryable. One of these non-retryable errors is WriteConcernFailed, which can be generated from write timeout errors.

Problem
The commands used by the resharding components are designed to simply wait for the results of their command, so we shouldn't throw an error if the write is taking a long time.

In the ReshardingDonorService, the updateCoordinatorDocument function utilizes the CatalogClient to update the coordinator document in the config server. Remove the usage of `ShardingCatalogClient::kMajorityWriteConcern` in its call to updateConfigDocument in the catalogClient. *It should be replaced with

{w: "majority"}

specified with no wtimeout.*

In the RecipientStateMachineExternalStateImpl, the updateCoordinatorDocument function utilizes the CatalogClient to update the coordinator document in the config server. Remove the usage of `ShardingCatalogClient::kMajorityWriteConcern` in its call to updateConfigDocument in the catalogClient. *It should be replaced with

{w: "majority"}

specified with no wtimeout.*

Source
You can see the following log lines in this Evergreen Patch

Unable to find source-code formatter for language: noformat. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml

756555:[j0:s1:n1] {"t":{"$date":"2021-10-27T18:51:32.274+00:00"},"s":"F",  "c":"RESHARD",  "id":5160600, "ctx":"ReshardingDonorService-1","msg":"Unrecoverable error occurred past the point donor was prepared to complete the resharding operation","attr":{"error":"WriteConcernFailed: waiting for replication timed out; Error details: { wtimeout: true, writeConcern: { w: \"majority\", wtimeout: 60000, provenance: \"clientSupplied\" } }"}}
756556:[j0:s1:n1] {"t":{"$date":"2021-10-27T18:51:32.274+00:00"},"s":"F",  "c":"ASSERT",   "id":23089,   "ctx":"ReshardingDonorService-1","msg":"Fatal assertion","attr":{"msgid":5160600,"file":"src/mongo/db/s/resharding/resharding_donor_service.cpp","line":459}}
756557:[j0:s1:n1] {"t":{"$date":"2021-10-27T18:51:32.275+00:00"},"s":"F",  "c":"ASSERT",   "id":23090,   "ctx":"ReshardingDonorService-1","msg":"\n\n***aborting after fassert() failure\n\n"}

is depended on by

SERVER-57686 We need test coverage that runs resharding in the face of elections

Closed

related to

SERVER-82838 ReshardingOplogApplier uses {w: "majority", wtimeout: 60000} write concern when persisting resharding oplog application progress

Closed

Assignee:: Luis Osta (Inactive)

Reporter:: Luis Osta (Inactive)

Participants:: Githook User, Luis Osta

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: Oct 27 2021 09:03:21 PM UTC

Updated:: Nov 06 2023 08:15:12 PM UTC

Resolved:: Oct 29 2021 02:35:01 PM UTC

Confidence Status Last Update:: 28/Oct/21 9:02 PM

Details

Description

Attachments

Issue Links

Activity

People

Dates