Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-7452

Improve logging when recovery (and RTS) is taking a long time

    • Type: Icon: Improvement Improvement
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • WT10.0.1, 4.4.7, 5.0.0-rc0, 5.1.0-rc0
    • Affects Version/s: None
    • Component/s: None
    • None
    • 5
    • Storage - Ra 2021-05-31
    • v4.4

      This ticket is the WiredTiger portion for SERVER-56222. Please look at SERVER ticket for more details and the context. Here is an extract from the SERVER ticket for this ticket completion:

      WT-7442 addresses an issue in rollback to stable, which makes the rollback/recovery process much longer than previous releases. WiredTiger also doesn't generate any logs to keep the user aware of what is going on. From the server ticket:

      To illustrate, it's been observed that a mongod spends multiple hours recovering after having a message similar to the following printed in the log with no other clear signs of progress:

      {"t":{"$date":"2021-04-21T00:51:14.703+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1618966274:703153][5978:0x7f67cef96bc0], file:collection-2-1893086824266225355.wt, txn-recover: [WT_VERB_RECOVERY_PROGRESS] Recovering log 62628 through 62628"}}
      {"t":{"$date":"2021-04-21T00:51:14.759+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1618966274:759611][5978:0x7f67cef96bc0], file:collection-2-1893086824266225355.wt, txn-recover: [WT_VERB_RECOVERY | WT_VERB_RECOVERY_PROGRESS] Set global recovery timestamp: (1618965463, 1)"}}
      {"t":{"$date":"2021-04-21T00:51:14.759+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1618966274:759674][5978:0x7f67cef96bc0], file:collection-2-1893086824266225355.wt, txn-recover: [WT_VERB_RECOVERY | WT_VERB_RECOVERY_PROGRESS] Set global oldest timestamp: (1618965458, 1)"}}
      

      The next message in the log - the timestamp show how large the gap in the logging is:

      {"t":{"$date":"2021-04-21T03:17:08.587+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1618975028:587005][5978:0x7f67cef96bc0], file:index-989625-1893086824266225355.wt, WT_SESSION.checkpoint: [WT_VERB_CHECKPOINT_PROGRESS] Checkpoint has been running for 843 seconds and wrote: 5000 pages (3 MB)"}}
      

      Though WiredTiger does have some verbose logging during the recovery, we need to re-evaluate if we can do better, especially during rollback to stable. Though WT-7442 is going to address the time taken to rollback when there are a lot of dhandles, this ticket should go over the code and add more logging to keep the user informed. Also, we should make an effort in identifying other portions of recovery where we can improve logging verbose messages.

      SERVER-56222 is marked blocked on this ticket. Once the WiredTiger ticket is resolved re-evaluate if any work is required for the server ticket. At the moment we do not think there is any server work required.

            Assignee:
            luke.pearson@mongodb.com Luke Pearson
            Reporter:
            sulabh.mahajan@mongodb.com Sulabh Mahajan
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: