Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 5.0.10, 6.0.0-rc9, 6.1.0-rc0
Affects Version/s: 5.3.0, 5.0.6
Component/s: None
Labels:
- sharding-wfbf-day

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v6.0, v5.3, v5.0
Steps To Reproduce:
Hide

0001-Repro-BF-24832.patch

./buildscripts/resmoke.py run --storageEngine=wiredTiger --storageEngineCacheSizeGB=.50 --suite=sharding jstests/sharding/bf-24832-repro.js --log=file
Show
0001-Repro-BF-24832.patch ./buildscripts/resmoke.py run --storageEngine=wiredTiger --storageEngineCacheSizeGB=.50 --suite=sharding jstests/sharding/bf-24832-repro.js --log=file
Sprint:
Sharding EMEA 2022-05-02, Sharding EMEA 2022-05-16, Sharding EMEA 2022-05-30, Sharding EMEA 2022-06-13
Linked BF Score:
48
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

The shardsvr's 'moveChunk' is allowed on primary nodes only. However this check is just a best effort – the member state could change anytime later and the command will continue.
The command body does take some precautions to ensure a stable member state: It briefly takes the GlobalLock in mode IX to:
(1) Flag that opCtx as should be killed on stepdown
(2) Synchronize with the thread that kills opCtxs on stepdown
This ensures that the MigrationSourceManager will will run on a single term (see BF-24411). However, it doesn't ensure that this node is primary. For instance, the following interleaving could happen:
1. The node is primary when this is evaluated
2. The node becomes secondary here
3. Here the opCtx will get flagged as killable on stepdown, but the node has already stepped down, so it won't be interrupted!

In this scenario the command will continue executing and will instantiate a MigrationSourceManager:
4. The MSM will check that there are no migrations pending recovery. Assume that there are none at this point.
5. Now the new primary starts a migration, inserts its recovery document and the old primary replicates it.
6. Now the old primary evaluates this invariant, find the document inserted on (5) and crashes.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

0001-SERVER-65371-Ensure-MigraitonSourceManager-is-only-i.patch
2 kB
Apr 08 2022 11:53:32 AM UTC
0001-Repro-BF-24832.patch
6 kB
Apr 08 2022 11:47:28 AM UTC

is caused by

SERVER-62296 MoveChunk should recover any unfinished migration before starting a new one

Closed

Assignee:: Paolo Polato
Reporter:: Jordi Serra Torrens
Participants:: Githook User, Jordi Serra Torrens, Paolo Polato
Votes:: 0 Vote for this issue
Watchers:: 3 Start watching this issue

Created:: Apr 08 2022 11:47:58 AM UTC
Updated:: Oct 29 2023 09:39:49 PM UTC
Resolved:: Jun 02 2022 04:04:56 PM UTC
Confidence Status Last Update:: 01/Jun/22 2:25 PM

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates