There are some pain points when it comes to debugability/observability from the oplog applier:
- If an error happens in applyOplogEntryOrGroupedInsertsCommon, we may return an error status that gets converted into Status::OK() if a node is in initial sync or other recovery modes. However, the caller of the function will continue to log that we have applied that operation, even if it was unsuccessful. We should expose the conversion or otherwise avoid logging Applied op if the apply was unsuccessful
There are multiple layers where we do try/catch blocks. For instance, in applyOplogEntryOrGroupedInsertsCommon, then again in applyOplogBatchCommon. We should unify the way we catch errors.Related to point 2, today functions in the oplog applier code path can either return error statuses, or throw an error status. As a result, it is confusing where exactly an error came from. We should either always throw an error status, or catch an error status at the lowest level and return it all the way up.- We catch errors and convert them to Status::OK() if the node is not in secondary oplog application mode. We should log a message when this occurs to make it clear what failures occurred, as this could lead to hidden data inconsistency.
EDIT: After triage today, we will focus this ticket on pain points #1 and #4. SERVER-78834 may refactor a lot of the code that affects #2 and #3.
- is related to
-
SERVER-78834 Unify constraint checking knobs used in oplog application.
- Backlog