Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Works as Designed
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: Replication
Labels:
- former-quick-wins

Assigned Teams:

Replication
Operating System:
ALL
Sprint:
Repl 2018-05-07
Linked BF Score:
6
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

ReplicationCoordinatorImpl::stepDown() calls ReplicationCoordinatorExternalStateImpl::killAllUserOperations() prior to calling TopologyCoordinator::prepareForStepDownAttempt().

auto globalLock = stdx::make_unique<Lock::GlobalLock>(
    opCtx, MODE_X, stepDownUntil, Lock::GlobalLock::EnqueueOnly());

// We've requested the global exclusive lock which will stop new operations from coming in,
// but existing operations could take a long time to finish, so kill all user operations
// to help us get the global lock faster.
_externalState->killAllUserOperations(opCtx);

...

status = _topCoord->prepareForStepDownAttempt();

The implications of the current behavior w.r.t. retryable writes are that server selection may choose to retry the write operation against the primary in the midst of stepping down. Since the global X lock is held for the duration of the primary's stepdown attempt, the retry attempt will be blocked on the server until ReplicationCoordinatorExternalStateImpl::closeConnections() (and thus ServiceEntryPoint::endAllSessions()) has been called. A driver would then see a network error but wouldn't retry the operation for yet another time because it has exhausted its one retry attempt quota.

For comparison: The reconnect() function in jstests/replsets/rslib.js works around this issue by retrying until it succeeds in running the "collStats" command because unlike the "isMaster" command, the "collStats" command requires acquiring the global lock and therefore must wait until the stepdown has finished.

is related to

SERVER-74409 StreamableReplicaSetMonitor::getHostsOrRefresh Can Return Out of Date Information

Closed

related to

SERVER-57167 Prevent throwing on session creation due to stepdown before stepdown completes

Closed

SERVER-34666 Reduce the number of retries needed for running the retryable_writes_jscore_stepdown_passthrough.yml test suite

Backlog

Assignee:: [DO NOT USE] Backlog - Replication Team
Reporter:: Max Hirschhorn
Participants:: [DO NOT USE] Backlog - Replication Team, Matthew Russotto, Max Hirschhorn, Spencer Brody
Votes:: 0 Vote for this issue
Watchers:: 11 Start watching this issue

Created:: Apr 23 2018 04:09:10 AM UTC
Updated:: Oct 27 2023 01:53:51 PM UTC
Resolved:: Feb 07 2020 03:56:46 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates