-
Type: Improvement
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
Query Execution
-
QO 2024-03-04, QO 2024-03-18, QO 2024-04-01, QO 2024-04-15, QO 2024-04-29, QO 2024-05-13, QO 2024-05-27, QO 2024-06-10, QO 2024-06-24
We have a loose requirement in the server that long-running operations should yield every 10ms (this is configurable for the query subsystem), however nothing about our cooperative scheduling implementation enforces this contract. As we begin to ideate on improvements in this space, it would be useful to see which types of operations currently acquire tickets without yielding in a reasonable time (or at all). I'm imagining a few improvements:
- Collect aggregated metrics for number of queries which hold tickets longer than the yielding threshold (default: 10ms). This is not only useful for triage, but could be integrated into a node health statistic for admission policies.
- As a generalization of the above, it would be valuable to keep a simple histogram of the duration of each ticket held.
- In addition to the above, it would be valuable to also keep a simple histogram of the number of ticket acquisitions per query
- Mark queries (perhaps using the slowms machinery? using query shape?) as being delinquent in ticket retention. This would help server engineers identify pathological cases where tickets are not being released by observing properties of the source query (or the query itself)
- related to
-
SERVER-87365 Execution control doesn't ramp up fast enough for slow yielding queries
- Backlog
-
SERVER-72258 Audit and add missing checkForInterrupt to SBE stages
- Closed
-
SERVER-86164 Create a test that catches operations that aren't interruptible for significant periods of time
- Backlog