Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: Not Applicable
Labels:
None

Assigned Teams:

Storage Engines
Sprint:
StorEng - Defined Pipeline
Story Points:
None

This came out of a HELP ticket + Slack thread (links in comments).

There are many functions where a non-zero return value could come from a number of places, for example __reconcile:

    WT_ERR(__rec_write_wrapup(session, r, page));
    __rec_write_page_status(session, r);
    WT_ERR(__reconcile_post_wrapup(session, r, page, flags, page_lockedp));
    // snip...
    if (__wt_ref_is_root(ref)) {
        WT_WITH_PAGE_INDEX(session, ret = __rec_root_write(session, page, flags));
        if (ret != 0)
            goto err;
        return (0);
    }
    // snip...
    WT_ERR(__wt_page_parent_modify_set(session, ref, true));
    // snip...

err:
    if (ret != 0)
        WT_RET_PANIC(session, ret, "reconciliation failed after building the disk image");

If we see this message in the wild, it's impractical to tell where it came from. There's a similar gap for functions using WT_ERR, where we often can't tell which "leaf" function call returned non-zero.

One useful tool we have available for fixing this is the fact that almost all of our error handling code uses one of WT_RET, WT_ERR, or related friends.

So it would be possible, without intrusive changes, to record more accurate failure information. For example, WT_RET could, for non-zero return values, store a tuple of (retval, line, function) in a small per-session circular buffer.

Exposing this is also possible - a WT_PANIC could dump it using the verbose system, and we could expose a wt_dump_crash_diagnostics API for callers that capture segmentation faults and attempt to print their own diagnostics (e.g. MongoDB).

The most likely sticking point is probably performance - there are non-zero return values (e.g. WT_NOTFOUND for cursors) that are frequent and potentially on the hot path.

This ticket is just a discussion point for whether we want to do something like this, and a record of the decision we make + why.

is duplicated by

WT-8300 Report WT_ERR failure points, or put breadcrumbs to figure them from the core

Closed

related to

WT-12772 Use the WT_RET_MSG macros in the block manager when returning EINVAL

Closed

Assignee:: [DO NOT USE] Backlog - Storage Engines Team
Reporter:: Will Korteland
Votes:: 0 Vote for this issue
Watchers:: 13 Start watching this issue

Created:: Mar 27 2024 03:32:27 AM UTC
Updated:: Mar 25 2025 01:21:56 AM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates