Uploaded image for project: 'Python Driver'
  1. Python Driver
  2. PYTHON-4600

bugfix: handle round trip time being negative because time.monotonic() is not monotonic #1758

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 4.9
    • Affects Version/s: None
    • Component/s: None
    • None
    • Python Drivers
    • Not Needed
    • Hide

      1. What would you like to communicate to the user about this feature?
      2. Would you like the user to see examples of the syntax and/or executable code and its output?
      3. Which versions of the driver/connector does this apply to?

      Show
      1. What would you like to communicate to the user about this feature? 2. Would you like the user to see examples of the syntax and/or executable code and its output? 3. Which versions of the driver/connector does this apply to?

      Copied from PR #1758

      Hello,

      This fixes this long standing issue that's been opened since 2017
      https://jira.mongodb.org/browse/PYTHON-1271
      https://jira.mongodb.org/browse/PYTHON-3616
      https://jira.mongodb.org/browse/PYTHON-4298

      We've been getting regular failures from the mongo client in CI, since updating to a more recent python and mongo client.
      image

      Symptom is the client crashing with round_trip_time of None.
      This is a very long standing issue as we've found bug tickets going back to 2017 regarding this problem.

      Oddly enough, we did not see that frequently on older python version and older client, although the bug is present. I suspect there's a subtle side effect from python 3.11+ being noticeably faster, which makes it a lot more common to get sub millisecond measurements.

      root cause:
      After many days of debugging, it turns out the issue is pretty simple: time.monotonic() is in fact not monotonic!
      Basic code to measure the round trip time start = time.monotonic() ... end = time.monotonic() doesn't work as expected, it can give a negative duration once in a while.

      The build nodes run NTP. I think NTP is adjusting the time and causing the issue. As per linux documentation CLOCK_MONOTONIC is subject to time adjustments by NTP and adjtime(). https://man7.org/linux/man-pages/man2/clock_gettime.2.html

      Related. The code to average round trip times had a condition to silently ignore negative durations sample < 0, which contributed to making the issue extremely difficult to debug. I've corrected the check to ensure that negative durations cannot be provided.

      There may be other places that are using time.monotonic() incorrectly. I've corrected the ones that crashed my production systems and the ones I could find by searching the code base. There are more usage to measure timeout or deadline which I think are not affected it the clock goes backward by a millisecond.

      Cheers.

            Assignee:
            shane.harvey@mongodb.com Shane Harvey
            Reporter:
            keana.moutra@mongodb.com KeAna Moutra
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: