The changes from fee0349 as part of SERVER-53912 introduced a RecipientStateMachine::_restoreMetrics() method to calculate the number of documents it cloned, oplog entries it fetched, and oplog entries it applied at the beginning of starting to run again. These read operations may be interrupted if the primary steps down shortly after having been stepped up. The call to RecipientStateMachine::_restoreMetrics() should be placed in a resharding::WithAutomaticRetry() block so any transient errors can be automatically retried and synchronized with the stepdown token being canceled.
As a bonus on this ticket, we should see if it is possible to have the resharding code invariant that all usages of CancelableOperationContextFactory occur within a resharding::WithAutomaticRetry() block.
[2021/09/02 13:54:14.598] [js_test:resharding_fuzzer-120e1-1630590374586-8] d20021| 2021-09-02T13:54:14.598+00:00 I COMMAND 21581 [conn1] "Received replSetStepUp request" [2021/09/02 13:54:14.603] [js_test:resharding_fuzzer-120e1-1630590374586-8] d20021| 2021-09-02T13:54:14.602+00:00 I REPL 21358 [ReplCoord-7] "Replica set state transition","attr":{"newState":"PRIMARY","oldState":"SECONDARY"} [2021/09/02 13:54:14.615] [js_test:resharding_fuzzer-120e1-1630590374586-8] d20021| 2021-09-02T13:54:14.615+00:00 I REPL 21331 [OplogApplier-0] "Transition to primary complete; database writes are now permitted" [2021/09/02 13:54:14.823] [js_test:resharding_fuzzer-120e1-1630590374586-8] d20021| 2021-09-02T13:54:14.823+00:00 I REPL 21402 [conn4] "Stepping down from primary, because a new term has begun","attr":{"term":6} ... [2021/09/02 13:54:14.825] [js_test:resharding_fuzzer-120e1-1630590374586-8] d20021| 2021-09-02T13:54:14.825+00:00 F RESHARD 5551101 [ReshardingRecipientService-5] "Unrecoverable error occurred past the point recipient was prepared to complete the resharding operation","attr":{"error":"InterruptedDueToReplStateChange: operation was interrupted"} [2021/09/02 13:54:14.825] [js_test:resharding_fuzzer-120e1-1630590374586-8] d20021| 2021-09-02T13:54:14.825+00:00 F ASSERT 23089 [ReshardingRecipientService-5] "Fatal assertion","attr":{"msgid":5551101,"file":"src/mongo/db/s/resharding/resharding_recipient_service.cpp","line":404}
- is caused by
-
SERVER-53912 ReshardingRecipientService instances to load metrics state upon instantiation
- Closed
- is depended on by
-
SERVER-53351 Add resharding fuzzer task with step-ups enabled for shards
- Closed