Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 7.1.0-rc0, 5.0.19, 7.0.0-rc6, 6.0.8
Affects Version/s: None
Component/s: Sharding
Labels:
- sharding-nyc-subteam1

Assigned Teams:

Sharding NYC
Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v7.0, v6.0, v5.0
Sprint:
Sharding NYC 2023-05-29
Linked BF Score:
105
Story Points:
5
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

The _migrateClone command isn't supposed to run on secondaries, but we perform no
checks for whether the node is currently primary after first admitting the command. If this command is still running while in the secondary state, it is possible for this deadlock to occur:

The command blocks on a prepared transaction while holding the PBWM lock
Oplog batch application blocks because the PBWM lock is held
Any attempt to end the prepared transaction would never be processed because oplog application is stuck.

Note that this deadlock requires the _migrateClone to start on a primary and then block on a prepared transaction after the node steps down to secondary.

The issue can most easily be solved in one of two ways:

Fail the _migrateClone command after taking the AutoGetActiveCloner and checking that the node is still primary. This helper takes the AutoGetCollection, which in turn takes the RSTL so that we can guarantee the node stays primary while it is in scope.
If for some reason we want this command to proceed while secondary, we could use AutoGetCollectionForRead or AutoGetCollectionForReadLockFree, which both skip taking the PBWM lock.

is related to

SERVER-71028 MigrationChunkClonerSourceLegacy::nextCloneBatch should ignore prepare conflict

Backlog

related to

SERVER-77242 Audit call sites that take the PBWM lock

Backlog

Assignee:: Randolph Tan
Reporter:: Louis Williams
Participants:: Githook User, Louis Williams, Randolph Tan
Votes:: 0 Vote for this issue
Watchers:: 8 Start watching this issue

Created:: Apr 26 2023 01:13:33 PM UTC
Updated:: Oct 29 2023 09:22:28 PM UTC
Resolved:: May 17 2023 07:44:50 PM UTC
Confidence Status Last Update:: 16/May/23 6:10 PM

Details

Description

Attachments

Issue Links

Activity

People

Dates