Loading...

XML

Word

Printable

JSON

Type: New Feature
Resolution: Done
Priority: Major - P3
Fix Version/s: 3.3.9
Affects Version/s: None
Component/s: Diagnostics, Performance
Labels:
- neweng
- performance

Backwards Compatibility:
Fully Compatible
Sprint:
Integrate+Tuning 16 (06/24/16), Integration 17 (07/15/16)
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

We should have instrumentation to characterize the overall workload and response time of a mongod or mongos server. A histogram with buckets with log base 2 microsecond resolution would be a nice start. Here's a straw man proposal:

1) For every request from a client, log the time it was received using the least expensive high resolution method. On Windows, this would be QueryPerformanceCounter().
2) When the response is complete, compute the elapsed time in microseconds. On Windows, this would be another call to QueryPerformanceCounter() and division by a precomputed conversion factor.
3) Add 1 to the bucket associated with this time interval. Bucket 0 gets all times below 1 microsecond, bucket 1 gets times above 1 microsecond but below 2 microseconds, bucket 2 gets times from 2 to 4 microseconds, then 4 to 8, etc. 31 buckets would cover times up to 2147 seconds and anything taking longer than 2147 seconds would go in the last bucket, so 32 buckets would cover the time periods we are most interested in.
4) Every 10 seconds, add the histogram to a "since started" histogram, write it to a capped collection sized for one week of data, save a snapshot copy and then zero it.
5) Provide $cmd commands to fetch the most recent snapshot and the "since started" histogram.
6) Give MMS the ability to show the most recent snapshot and the "since started" snapshot.
7) For extra credit, MMS could show a contour plot or some other 3D display of response time history, showing the changing shape of the curve.

Once the baseline functionality is working, we could consider doing this by database, by collection, by request type or by some other criterion. These would be additional instances of the same feature.

There are a lot of things that we could learn by having this information:
1) If a query was slow at one time but not at another, was there a difference in the number of requests it was competing with in the two cases?
2) Is a workload doing mostly very fast stuff with a little slow stuff, or is everything slow?
3) Does a change to something in the system change the mix of response times?
4) Do response times follow a recognizable pattern, like a bell curve with a visible center, or a skew towards fast responses, or a curve with multiple peaks?
5) Is anything really fast, or is the minimum response time in the millisecond and above range?
6) Do we have periods with little visible activity followed by periods when many slow requests complete?
7) Does the addition of a new application, or a new shard, or a new mongos change the response time pattern?

The better we can characterize workloads and our response to them, the better we can diagnose problems and propose solutions. All to the good.

is duplicated by

SERVER-7774 Add jstest for db.adminCommand('top')

Closed

related to

SERVER-5828 Metric/Stats Tracking

Closed

Assignee:: Kevin Albertson
Reporter:: Tad Marshall
Participants:: Githook User, Kevin Albertson, Tad Marshall
Votes:: 5 Vote for this issue
Watchers:: 16 Start watching this issue

Created:: May 23 2012 03:08:16 AM UTC
Updated:: Mar 22 2017 03:27:48 PM UTC
Resolved:: Jun 24 2016 09:53:59 PM UTC
Confidence Status Last Update:: 13/Jun/16 8:25 PM

Details

Description

Attachments

Issue Links

Activity

People

Dates