Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-86504

Better observability for operations which exceed yielding and interrupt deadlines

    • Query Execution
    • QO 2024-03-04, QO 2024-03-18, QO 2024-04-01, QO 2024-04-15, QO 2024-04-29, QO 2024-05-13, QO 2024-05-27, QO 2024-06-10, QO 2024-06-24

      We have a loose requirement in the server that long-running operations should yield every 10ms (this is configurable for the query subsystem), however nothing about our cooperative scheduling implementation enforces this contract. As we begin to ideate on improvements in this space, it would be useful to see which types of operations currently acquire tickets without yielding in a reasonable time (or at all). I'm imagining a few improvements:

      • Collect aggregated metrics for number of queries which hold tickets longer than the yielding threshold (default: 10ms).  This is not only useful for triage, but could be integrated into a node health statistic for admission policies.
      • As a generalization of the above, it would be valuable to keep a simple histogram of the duration of each ticket held.
      • In addition to the above, it would be valuable to also keep a simple histogram of the number of ticket acquisitions per query
      • Mark queries (perhaps using the slowms machinery? using query shape?) as being delinquent in ticket retention. This would help server engineers identify pathological cases where tickets are not being released by observing properties of the source query (or the query itself)

            Assignee:
            kevin.cherkauer@mongodb.com Kevin Cherkauer
            Reporter:
            matt.broadstone@mongodb.com Matt Broadstone
            Votes:
            0 Vote for this issue
            Watchers:
            29 Start watching this issue

              Created:
              Updated: