-
Type: Improvement
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Not Applicable
-
None
-
Storage Engines
-
StorEng - Defined Pipeline
This came out of a HELP ticket + Slack thread (links in comments).
There are many functions where a non-zero return value could come from a number of places, for example __reconcile:
WT_ERR(__rec_write_wrapup(session, r, page)); __rec_write_page_status(session, r); WT_ERR(__reconcile_post_wrapup(session, r, page, flags, page_lockedp)); // snip... if (__wt_ref_is_root(ref)) { WT_WITH_PAGE_INDEX(session, ret = __rec_root_write(session, page, flags)); if (ret != 0) goto err; return (0); } // snip... WT_ERR(__wt_page_parent_modify_set(session, ref, true)); // snip... err: if (ret != 0) WT_RET_PANIC(session, ret, "reconciliation failed after building the disk image");
If we see this message in the wild, it's impractical to tell where it came from. There's a similar gap for functions using WT_ERR, where we often can't tell which "leaf" function call returned non-zero.
One useful tool we have available for fixing this is the fact that almost all of our error handling code uses one of WT_RET, WT_ERR, or related friends.
So it would be possible, without intrusive changes, to record more accurate failure information. For example, WT_RET could, for non-zero return values, store a tuple of (retval, line, function) in a small per-session circular buffer.
Exposing this is also possible - a WT_PANIC could dump it using the verbose system, and we could expose a wt_dump_crash_diagnostics API for callers that capture segmentation faults and attempt to print their own diagnostics (e.g. MongoDB).
The most likely sticking point is probably performance - there are non-zero return values (e.g. WT_NOTFOUND for cursors) that are frequent and potentially on the hot path.
This ticket is just a discussion point for whether we want to do something like this, and a record of the decision we make + why.