Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 8.1.0-rc0, 8.0.0-rc19
Affects Version/s: 8.1.0-rc0
Component/s: None
Labels:
- 8.0-release-blocker

Assigned Teams:

Cluster Scalability
Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v8.0
Linked BF Score:
200
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

Currently, the resharding cloner logic has a potential flaw that could lead to data loss if an exception occurs during the refresh phase. That was spotted after the introduction (and subsequent reversion) of ~~SERVER-92530~~. While the immediate risk has been mitigated by reverting ~~SERVER-92530~~, the underlying issue in the cloner logic still needs to be addressed to improve its resilience and prevent potential data loss in future scenarios.

Issue Summary:

The cloner fetches the list of documents from the donor and writes them to a temporary collection.
The write operation is managed by resharding::data_copy::withOneStaleConfigRetry, which forces a refresh and re-calls the failed callback in case of a StaleConfig error.
In such cases, _writeOnceWithNaturalOrder is executed twice.
However, the dispatchResult query gets moved at the first attempt, resulting in an empty batch on the second attempt.
This can lead to an empty temporary collection, and as consequence to a data loss.

Currently, the _writeOnceWithNaturalOrder is protected from hitting a StaleConfig as the cloner thread created by the function itself follows the same logic . In case of StaleConfig, it's the cloner thread that refreshes.
~~SERVER-92530~~ introduced the possibility for a refresh to fail. In case of failure, _writeOnceWithNaturalOrder would find stale metadata and behave as described above. Even though reverted, we are still planning to re-commit ~~SERVER-92530~~ once the related issues are fixed.

is related to

SERVER-94387 Convert resharding_collection_cloner_stale_config_retry.js into an unittest

Backlog

Assignee:: Kruti Shah
Reporter:: Enrico Golfieri
Participants:: Enrico Golfieri, Githook User, Kruti Shah
Votes:: 0 Vote for this issue
Watchers:: 14 Start watching this issue

Created:: Aug 28 2024 01:55:09 PM UTC
Updated:: Sep 18 2024 03:49:59 PM UTC
Resolved:: Sep 04 2024 01:39:32 PM UTC
Confidence Status Last Update:: 29/Aug/24 9:32 PM

Details

Description

Attachments

Issue Links

Activity

People

Dates