Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Won't Do
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 3.4.0, 3.6.0, 4.0.0, 4.2.0
Component/s: Sharding
Labels:
None

Assigned Teams:

Sharding EMEA
Operating System:
ALL
Steps To Reproduce:
Hide

Apply server42192.patch to allow the moveChunk command to be automatically retried in the presence of failovers and run the agg_with_chunk_migrations.js FSM workload. The --repeat is necessary because while this concurrency test reproduces the issue often, it doesn't happen every time.

python buildscripts/resmoke.py --suite=concurrency_sharded_multi_stmt_txn_terminate_primary jstests/concurrency/fsm_workloads/agg_with_chunk_migrations.js --repeat=10
Show
Apply server42192.patch to allow the moveChunk command to be automatically retried in the presence of failovers and run the agg_with_chunk_migrations.js FSM workload. The --repeat is necessary because while this concurrency test reproduces the issue often, it doesn't happen every time. python buildscripts/resmoke.py --suite=concurrency_sharded_multi_stmt_txn_terminate_primary jstests/concurrency/fsm_workloads/agg_with_chunk_migrations.js --repeat=10
Confidence Status:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

The changes from cc8e8a1 as part of ~~SERVER-26307~~ made it so a BalancerInterrupted error response is no longer returned when the moveChunk command fails due to a retryable error on the replica set shard primary. Additionally, the changes from 53efde3 as part of ~~SERVER-25999~~ made it so an OperationFailed error status would be returned by MigrationManager::_processRemoteCommandResponse(); however, any non-BalancerInterrupted error status is converted to an ok=1 response so long as the chunk has successfully been migrated. It does not check if _waitForDelete=true had been specified in the moveChunk command request to realize that we may not have waited long enough for the range to be cleaned up.

We should either (a) wait long enough, or (b) preserve the OperationFailed error response as a way to inform the user.

Status commandStatus = _processRemoteCommandResponse(
    remoteCommandResponse, &statusWithScopedMigrationRequest.getValue());

// Migration calls can be interrupted after the metadata is committed but before the command
// finishes the waitForDelete stage. Any failovers, therefore, must always cause the moveChunk
// command to be retried so as to assure that the waitForDelete promise of a successful command
// has been fulfilled.
>if (chunk->getShardId() == migrateInfo.to && commandStatus != ErrorCodes::BalancerInterrupted) {
    return Status::OK();
}

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

server42192.patch
Mar 06 2020 01:38:35 AM UTC
29 kB
Max Hirschhorn

is caused by

SERVER-26307 MigrationManager can keep a migration document when not in stepdown / shutdown because it can't differentiate between its own error codes and those of the shard with which it communicates

Closed

is related to

SERVER-25999 Mongos applies errors received from config server as config server errors, rather than a shard the config server calls and returns the error from

Closed

SERVER-42192 Write a concurrency workload to test that orphaned ranges are always deleted and nothing that shouldn’t be deleted gets deleted

Closed

related to

SERVER-53094 Tests which use {waitForDelete:true} on moveChunk are not safe to run in the sharding_csrs_continuous_config_stepdown suite

Closed

SERVER-66716 WaitForDelete may not be honored in case of retry

Closed

SERVER-64181 Remove TODO listed in SERVER-46669

Closed

(1 related to)

Assignee:: [DO NOT USE] Backlog - Sharding EMEA

Reporter:: Max Hirschhorn

Participants:: [DO NOT USE] Backlog - Sharding EMEA, Cris Insignares Cuello, Max Hirschhorn

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: Mar 06 2020 01:39:17 AM UTC

Updated:: Dec 06 2022 02:33:50 AM UTC

Resolved:: Aug 23 2022 01:47:51 PM UTC

GA Target Date:: None

Public Preview Target Date:: None

Private Preview Target Date:: None

Experiment Target Date:: None

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates