-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: None
-
None
-
Python Drivers
-
Not Needed
-
Copied from PR #1758
Hello,
This fixes this long standing issue that's been opened since 2017
https://jira.mongodb.org/browse/PYTHON-1271
https://jira.mongodb.org/browse/PYTHON-3616
https://jira.mongodb.org/browse/PYTHON-4298
We've been getting regular failures from the mongo client in CI, since updating to a more recent python and mongo client.
image
Symptom is the client crashing with round_trip_time of None.
This is a very long standing issue as we've found bug tickets going back to 2017 regarding this problem.
Oddly enough, we did not see that frequently on older python version and older client, although the bug is present. I suspect there's a subtle side effect from python 3.11+ being noticeably faster, which makes it a lot more common to get sub millisecond measurements.
root cause:
After many days of debugging, it turns out the issue is pretty simple: time.monotonic() is in fact not monotonic!
Basic code to measure the round trip time start = time.monotonic() ... end = time.monotonic() doesn't work as expected, it can give a negative duration once in a while.
The build nodes run NTP. I think NTP is adjusting the time and causing the issue. As per linux documentation CLOCK_MONOTONIC is subject to time adjustments by NTP and adjtime(). https://man7.org/linux/man-pages/man2/clock_gettime.2.html
Related. The code to average round trip times had a condition to silently ignore negative durations sample < 0, which contributed to making the issue extremely difficult to debug. I've corrected the check to ensure that negative durations cannot be provided.
There may be other places that are using time.monotonic() incorrectly. I've corrected the ones that crashed my production systems and the ones I could find by searching the code base. There are more usage to measure timeout or deadline which I think are not affected it the clock goes backward by a millisecond.
Cheers.
- related to
-
PYTHON-1271 Fail to read when server topology is not ready.
- Closed
-
PYTHON-4298 Raise informative error when server_description.round_trip_time is None in server selection
- Closed
-
PYTHON-3616 Use minimum RTT for CSOT maxTimeMS calculation instead of 90th percentile
- Closed