-
Type: Improvement
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Checkpoints
-
StorEng - Refinement Pipeline
Generally, WiredTiger checkpoints are expected not to fail after a certain point. There is a point of no return before which the application can ignore a checkpoint failure and continue. In practice, most of the checkpoint errors are fatal. Either WiredTiger is expected to panic or roll back and return a panic to the application.
In WT-10989 I made several changes to the checkpoint's block manager. When a checkpoint is configured to also switch the underlying files and flush the previous files to the next tier, failures are hard to handle. If we have switched the underlying file and then the checkpoint fails at a later stage, ideally we should switch back to the pervious file. What if the newer file already starts getting writes in the meanwhile?
Though we already expect that the existing checkpoint code itself treats errors at this stage as non-recoverable, I made sure to panic the system in case there is an error and the checkpoint is configured to flush the files.
I am filing this ticket for the team to have a discussion around the existing checkpoint failure handling and to brainstorm if we can gracefully handle tier flush errors, or if it is safer to just panic as proposed by WT-10989.
- related to
-
WT-10989 Implement precise coordination of checkpoint and flush for tiered tables
- Closed