-
Type: Improvement
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Catalog and Routing
-
2
Once a node steps up, it will try to recover the shardVersion as part of the resume migration hook.
Until the resume migration is over, the shardVersion will be marked as UNKNOWN which won't allow any read or write operation to be served.
As part of resume, the migration will be completed. The completion will depend on whether the collection was either committed or aborted:
In case is aborted the donor will
- Exit the critical section on the recipient
- Schedule a range deletion for possible orphans on the recipient
- Delete the range deletion task locally
In case is committed the donor will:
- Exit the critical section on the recipient
- Schedule a range deletion task locally for possible orphans on the donor
- Delete the range deletion task on the recipient
Ideally, the entire completion could be done asynchronously which would re-enable read and writes faster on the donor.
Note this ticket is just a suggestion as part of the conclusion taken on BF-34016 investigation, where the recovery on the donor caused a transaction on the recipient to block. The required time and cost of implementation should be evaluated carefully.
In general, we should also evaluate whether the benefit of such implementation would outweigh the costs.