Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-6778

Checkpoint may write partial resolved prepared update to disk

    • v5.3

              /* Ignore prepared updates if it is checkpoint. */
              if (upd->prepare_state == WT_PREPARE_LOCKED ||
                upd->prepare_state == WT_PREPARE_INPROGRESS) {
                  WT_ASSERT(session, upd_select->upd == NULL || upd_select->upd->txnid == upd->txnid);
                  if (F_ISSET(r, WT_REC_CHECKPOINT)) {
                      has_newer_updates = true;
                      if (upd->start_ts > max_ts)
                          max_ts = upd->start_ts;
      
                      /*
                       * Track the oldest update not on the page, used to decide whether reads can use the
                       * page image, hence using the start rather than the durable timestamp.
                       */
                      if (upd->start_ts < r->min_skipped_ts)
                          r->min_skipped_ts = upd->start_ts;
                      continue;
                  } else {
                      /*
                       * For prepared updates written to the date store in salvage, we write the same
                       * prepared value to the date store. If there is still content for that key left in
                       * the history store, rollback to stable will bring it back to the data store.
                       * Otherwise, it removes the key.
                       */
                      WT_ASSERT(session,
                        F_ISSET(r, WT_REC_EVICT) ||
                          (F_ISSET(r, WT_REC_VISIBILITY_ERR) &&
                            F_ISSET(upd, WT_UPDATE_PREPARE_RESTORED_FROM_DS)));
                      WT_ASSERT(session, upd->prepare_state == WT_PREPARE_INPROGRESS);
                  }
      

      With the current implementation, checkpoint may see partial resolved prepared updates on the same key and write that to disk.

      The detailed scenario is like follow:

      Suppose we have the update chain like U_prepared2@10 -> U_prepared1@10

      Checkpoint starts

      We commit the prepared update and resolve the U_preapred2 to U_committed@11_durable@12.

      Context switch happens and we have U_committed@11_durable@12 -> U_prepared1@10 on the update chain.

      Checkpoint comes to the page and sees U_committed@11_durable@12 and decide to write it to the disk image.

      Checkpoint sees U_prepared1@10 and set has_newer_updates to true but never unsets the update that should be written to disk (U_committed@11_durable@12).

      In this case, we write U_committed@11_durable@12 to the data store and U_prepared1@10 to the history store, which is wrong.

            Assignee:
            keith.bostic@mongodb.com Keith Bostic (Inactive)
            Reporter:
            chenhao.qu@mongodb.com Chenhao Qu
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

              Created:
              Updated:
              Resolved: