-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: 8.1.0-rc0
-
Component/s: None
-
Cluster Scalability
-
Fully Compatible
-
ALL
-
v8.0
-
200
Currently, the resharding cloner logic has a potential flaw that could lead to data loss if an exception occurs during the refresh phase. That was spotted after the introduction (and subsequent reversion) of SERVER-92530. While the immediate risk has been mitigated by reverting SERVER-92530, the underlying issue in the cloner logic still needs to be addressed to improve its resilience and prevent potential data loss in future scenarios.
Issue Summary:
- The cloner fetches the list of documents from the donor and writes them to a temporary collection.
- The write operation is managed by resharding::data_copy::withOneStaleConfigRetry, which forces a refresh and re-calls the failed callback in case of a StaleConfig error.
- In such cases, _writeOnceWithNaturalOrder is executed twice.
- However, the dispatchResult query gets moved at the first attempt, resulting in an empty batch on the second attempt.
- This can lead to an empty temporary collection, and as consequence to a data loss.
Currently, the _writeOnceWithNaturalOrder is protected from hitting a StaleConfig as the cloner thread created by the function itself follows the same logic . In case of StaleConfig, it's the cloner thread that refreshes.
SERVER-92530 introduced the possibility for a refresh to fail. In case of failure, _writeOnceWithNaturalOrder would find stale metadata and behave as described above. Even though reverted, we are still planning to re-commit SERVER-92530 once the related issues are fixed.
- is related to
-
SERVER-94387 Convert resharding_collection_cloner_stale_config_retry.js into an unittest
- Backlog