Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-7688

Identifying and handling corrupted files in WiredTiger

    • Type: Icon: Technical Debt Technical Debt
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None

      I apologize for the shotgun tag on this ticket, but I'm having problems figuring out what the plan is for corrupted files in WiredTiger.

      We have the WT_CONNECTION.WT_CONN_DATA_CORRUPTION flag, and what we mostly do is:

      1. set the WT_SESSION. WT_SESSION_QUIET_CORRUPT_FILE flag,
      2. make block manager calls,
      3. check the WT_CONNECTION.WT_CONN_DATA_CORRUPTION flag if the block manager call returns an error.

      See https://github.com/wiredtiger/wiredtiger/blob/develop/src/btree/bt_handle.c#L625-L649:

          F_SET(session, WT_SESSION_QUIET_CORRUPT_FILE);
          if ((ret = __wt_bt_read(session, &dsk, addr, addr_size)) == 0)
              ret = __wt_verify_dsk(session, tmp->data, &dsk);
          /*
           * Flag any failed read or verification: if we're in startup, it may be fatal.
           */
          if (ret != 0)
              F_SET(S2C(session), WT_CONN_DATA_CORRUPTION);
          F_CLR(session, WT_SESSION_QUIET_CORRUPT_FILE);
          if (ret != 0)
              __wt_err(session, ret, "unable to read root page from %s", session->dhandle->name);
          /*
           * Failure to open metadata means that the database is unavailable. Try to provide a helpful
           * failure message.
           */
          if (ret != 0 && WT_IS_METADATA(session->dhandle)) {
              __wt_err(session, ret, "WiredTiger has failed to open its metadata");
              __wt_err(session, ret,
                "This may be due to the database files being encrypted, being from an older version or "
                "due to corruption on disk");
              __wt_err(session, ret,
                "You should confirm that you have opened the database with the correct options including "
                "all encryption and compression options");
          }
          WT_ERR(ret);
      

      and https://github.com/wiredtiger/wiredtiger/blob/develop/src/txn/txn_rollback_to_stable.c#L1570-L1584:

              F_SET(session, WT_SESSION_QUIET_CORRUPT_FILE);
              ret = __rollback_to_stable_btree_apply(session, uri, config, rollback_timestamp);
              F_CLR(session, WT_SESSION_QUIET_CORRUPT_FILE);
      
              /*
               * Ignore rollback to stable failures on files that don't exist or files where corruption is
               * detected.
               */
              if (ret == ENOENT || (ret == WT_ERROR && F_ISSET(S2C(session), WT_CONN_DATA_CORRUPTION))) {
                  __wt_verbose(session, WT_VERB_RECOVERY_RTS(session),
                    "%s: skipped performing rollback to stable because the file %s", uri,
                    ret == ENOENT ? "does not exist" : "is corrupted.");
                  continue;
              }
              WT_ERR(ret);
      

      This seems fragile to me, because we require in the RTS code that ret == WT_ERROR before we check that corrupted flag, which leads to https://github.com/wiredtiger/wiredtiger/blob/develop/src/block/block_open.c#L360-L370, added in WT-5844:

          if (block->size < allocsize) {
              if (F_ISSET(session, WT_SESSION_ROLLBACK_TO_STABLE))
                  ret = ENOENT;
              else {
                  ret = WT_ERROR;
                  F_SET(S2C(session), WT_CONN_DATA_CORRUPTION);
              }
              WT_RET_MSG(session, ret,
                "File %s is smaller than allocation size; file size=%" PRId64 ", alloc size=%" PRIu32,
                block->name, block->size, allocsize);
          }
      

      I don't see any reason to think we'll be able to keep the error handling consistent across future iterations of the code, this seems like a bug waiting to happen?

      The reason I started looking at this is that RTS doesn't clear the WT_CONNECTION.WT_CONN_DATA_CORRUPTION flag, so any failure to roll a file forward due to corruption is going to leave that flag set. I don't see how that can be correct – I don't see how an application can subsequently get control of the database to run salvage.

      My naive view of this is RTS would reasonably detect corruption in a single file and report it, but wouldn't crash the system.

      Of course, I may be totally wrong on all of this, in which case I'd really appreciate someone explaining this to me!

      If I'm correct, then I think we should:

      • Add a session flag that gets set in the case of block-manager corruption, and query that flag if a block manager call returns an error. Since the block-manager will panic if WT_SESSION_QUIET_CORRUPT_FILE isn't set, there aren't that many paths that will need review.
      • At a higher level, we can set the WT_CONN_DATA_CORRUPTION flag of course, but the block manager shouldn't be doing that work.
      • Specifically, RTS would report the corruption and not set the WT_CONN_DATA_CORRUPTION flag.

      This helps us move toward gracefully handling the corruption of a specific collection in the future, I think, although it doesn't entirely solve that issue.

      Everybody tagged here has modified the handling of the WT_CONNECTION.WT_CONN_DATA_CORRUPTION flag in some chunk of code (well, except Alex, I'm copying him for an architectural view). If you don't want to be part of this conversation, please just flag that in the ticket: ravi.giri, donald.anderson, sue.loverso, alex.cameron, haribabu.kommi, alexander.gorrod.

      alex.cameron, I don't understand the requirements of WT-5844: is there a MongoDB requirement that WT_ERROR be returned as a result of specific errors?

            Assignee:
            backlog-server-storage-engines [DO NOT USE] Backlog - Storage Engines Team
            Reporter:
            keith.bostic@mongodb.com Keith Bostic (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: