Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 6.1.0-rc0
Affects Version/s: None
Component/s: None
Labels:
None

Backwards Compatibility:
Fully Compatible
Sprint:
Service Arch 2022-05-16, Service Arch 2022-05-30
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

The function OperationContext::setAlwaysInterruptAtStepDownOrUp can be used to request that an operation context is interrupted/killed on replication state change.

However, the function is dangerous to use because it is not synchronized internally with the RSTL, and does not insist that the caller is synchronized with the RSTL. This means that a node's view of it's replication state may change while the function is being called unless the RSTL is held by the caller, which causes subtle races because operations are registered to be killed only after state-change has already occurred. Here's an example of such a race:

Thread enters function to setAlwaysInterruptAtstepDownOrStepUp() on Op without holding the RSTL // Op not yet registered to be killed
Node steps down
setAlwaysInterruptAtstepDownOrStepUp() completes // Op registered to be killed now
Op continues running anyways because of race above

We have seen many BFs with races of the above form. As a first step to improve the situation, in this ticket, let's add some documentation emphasizing the lack of built-in-RSTL synchronization with this function, and express the risk of concurrent replication-state-change clearly. Let's also consider adding _UNSAFE to the end of the function name, which would follow the pattern of functions on the replication coordinator that are not synchronized with the RSTL (see https://github.com/mongodb/mongo/blob/d5399825310e599b0cad119664c23e10d98ca5af/src/mongo/db/repl/replication_coordinator_impl.h#L152).

is related to

SERVER-59719 shardsvr{Commit, Abort}ReshardCollection may return unrecoverable error on stepdown, leading to fassert() on config server

Closed

Assignee:: George Wangensteen (Inactive)
Reporter:: George Wangensteen (Inactive)
Participants:: George Wangensteen, Githook User
Votes:: 0 Vote for this issue
Watchers:: 5 Start watching this issue

Created:: May 10 2022 03:16:51 PM UTC
Updated:: Oct 29 2023 09:38:24 PM UTC
Resolved:: May 24 2022 02:46:20 PM UTC
Confidence Status Last Update:: 11/May/22 6:44 PM

Details

Description

Attachments

Issue Links

Activity

People

Dates