-
Type: Improvement
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Service Arch
-
SP Prioritized List
During our investigation of SERVER-89932, we found several instances in which blocked/wait times could begin accruing before operation timers had been started. We found that the root cause was due to the fact that we start the operation timers manually at varying places in an operation's lifetime, which turns out to be inconsistent and error-prone. If the timers were automatically started somewhere in the centralized initialization path for an operation (e.g. OperationContext or CurOp construction, on thread attach, etc.), we would have much more reliable and consistent timing of operations. This would make it easier to reason about the timing and any derived metrics like workingMillis.
We chose not to fix this immediately while working on SERVER-89932 because it likely requires some non-trivial refactoring to achieve due to link cycles. Some initial things we tried included:
- Using the OperationContext timer/tick source. This did not work for a number of reasons.
- We maintain a CurOpStack for tracking suboperations, rather than a single CurOp instance per OperationContext.
- CurOpStack decorates the OperationContext, and has a CurOp _base member which is initialized in the constructor. Decoration construction occurs before the decorated class is constructed.
- Starting the CurOp timers on CurOp construction. This could work for the plain "tick" timer, but runs into an issue with the CPU-timers.
- We maintain the OperationCPUTimers class as a decoration on the OperationContext as well, so we can't rely on it to be initialized for the CurOpStack constructor.
- We thought about making OperationCPUTimers a member of CurOpStack so we could reason about initialization order, but this is currently compiled in the service_context library, as its code is accessed when attaching/detaching clients to/from threads.
- CurOp currently includes code that pulls in many higher-level database libraries that we definitely don't want to link into service_context.
- related to
-
SERVER-90939 Operation latencies are tracked inconsistently
- Open
-
SERVER-89932 Minimize _blockedTimeAtStart in curop
- Closed
-
SERVER-89961 Complete TODO listed in SERVER-87201
- Closed