Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 5.2.0, 5.0.4, 5.1.0-rc1
Affects Version/s: 5.0.0
Component/s: Sharding
Labels:
- PM-234-M3
- PM-234-T-lifecycle

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v5.1, v5.0
Sprint:
Sharding 2021-09-20, Sharding 2021-10-04, Sharding 2021-10-18
Story Points:
1
Confidence Status:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

The changes from fee0349 as part of ~~SERVER-53912~~ introduced a RecipientStateMachine::_restoreMetrics() method to calculate the number of documents it cloned, oplog entries it fetched, and oplog entries it applied at the beginning of starting to run again. These read operations may be interrupted if the primary steps down shortly after having been stepped up. The call to RecipientStateMachine::_restoreMetrics() should be placed in a resharding::WithAutomaticRetry() block so any transient errors can be automatically retried and synchronized with the stepdown token being canceled.

As a bonus on this ticket, we should see if it is possible to have the resharding code invariant that all usages of CancelableOperationContextFactory occur within a resharding::WithAutomaticRetry() block.

[2021/09/02 13:54:14.598] [js_test:resharding_fuzzer-120e1-1630590374586-8] d20021| 2021-09-02T13:54:14.598+00:00 I  COMMAND  21581   [conn1] "Received replSetStepUp request"
[2021/09/02 13:54:14.603] [js_test:resharding_fuzzer-120e1-1630590374586-8] d20021| 2021-09-02T13:54:14.602+00:00 I  REPL     21358   [ReplCoord-7] "Replica set state transition","attr":{"newState":"PRIMARY","oldState":"SECONDARY"}
[2021/09/02 13:54:14.615] [js_test:resharding_fuzzer-120e1-1630590374586-8] d20021| 2021-09-02T13:54:14.615+00:00 I  REPL     21331   [OplogApplier-0] "Transition to primary complete; database writes are now permitted"
[2021/09/02 13:54:14.823] [js_test:resharding_fuzzer-120e1-1630590374586-8] d20021| 2021-09-02T13:54:14.823+00:00 I  REPL     21402   [conn4] "Stepping down from primary, because a new term has begun","attr":{"term":6}
...
[2021/09/02 13:54:14.825] [js_test:resharding_fuzzer-120e1-1630590374586-8] d20021| 2021-09-02T13:54:14.825+00:00 F  RESHARD  5551101 [ReshardingRecipientService-5] "Unrecoverable error occurred past the point recipient was prepared to complete the resharding operation","attr":{"error":"InterruptedDueToReplStateChange: operation was interrupted"}
[2021/09/02 13:54:14.825] [js_test:resharding_fuzzer-120e1-1630590374586-8] d20021| 2021-09-02T13:54:14.825+00:00 F  ASSERT   23089   [ReshardingRecipientService-5] "Fatal assertion","attr":{"msgid":5551101,"file":"src/mongo/db/s/resharding/resharding_recipient_service.cpp","line":404}

https://evergreen.mongodb.com/lobster/evergreen/test/mongodb_mongo_master_enterprise_rhel_80_64_bit_resharding_fuzzer_stepup_1_enterprise_rhel_80_64_bit_patch_23f9d2a53917d63fc3d3b8c8646f40f2bc4caa2f_6130c5d561837d6514713be6_21_09_02_12_44_55/0/6130e1232fd552933c3a0c9a#bookmarks=0%2C26712%2C26783%2C34982%2C35005%2C35160%2C35363%2C35372%2C35373%2C35629%2C153752%2C153798&f~=000~d20021%5C%7C&f~=100~%5C%5BResharding.%2AService&f~=010~%28REPL_HB%7CELECTION%29&f~=011~REPL_HB&l=1

is caused by

SERVER-53912 ReshardingRecipientService instances to load metrics state upon instantiation

Closed

is depended on by

SERVER-53351 Add resharding fuzzer task with step-ups enabled for shards

Closed

Assignee:: Brett Nawrocki

Reporter:: Max Hirschhorn

Participants:: Brett Nawrocki, Githook User, Max Hirschhorn

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: Sep 13 2021 11:08:18 PM UTC

Updated:: Oct 29 2023 09:48:39 PM UTC

Resolved:: Oct 12 2021 02:51:50 PM UTC

Confidence Status Last Update:: 24/Sep/21 3:29 PM

Details

Description

Attachments

Issue Links

Activity

People

Dates