Uploaded image for project: 'Python Driver'
  1. Python Driver
  2. PYTHON-2584

Investigate if the default localThresholdMS=15 is the culprit behind various flaky tests

    • Type: Icon: Task Task
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 4.0
    • Affects Version/s: None
    • Component/s: Tests
    • None

      While working on PYTHON-2486 I ran into an issue with the default localThresholdMS which I described to the team in slack:
      I’m seeing some strange behavior while benchmarking the last piece of avoid connection storms and finally figured it out. The benchmark has a client connected to a 3 member replica set and performs a number of find operations with secondary read preference. After the test is done, my app gets the Pool for each secondary to report how many total connections were created (using topology.select_servers(secondary_server_selector)). The issue is that sometimes this would only return one secondary instead of two.

      So I added SDAM loggers to see why the server was being marked unknown, and nothing… The secondary state was always known.

      Finally I remembered localThresholdMS which defaults to 15 and looked at the RTTs:

      Cluster: <TopologyDescription id: 60272e17b19196ca490135df, topology_type: ReplicaSetWithPrimary, servers: [
      <ServerDescription ('localhost', 27017) server_type: RSSecondary, rtt: 0.09419754099999977>,
      <ServerDescription ('localhost', 27018) server_type: RSSecondary, rtt: 0.05396648700000384>,
      <ServerDescription ('localhost', 27019) server_type: RSPrimary, rtt: 0.06699502099999677>]>
      

      Sure enough, one of the secondary’s RTT is way outside the 15ms latency window so it is excluded from server selection. But the question now is: why are the RTTs so high?! I assume it’s because the benchmark runs a ton of threads which delays the Monitor thread from running in a timely manner. So the Monitor thread thinks it took 100ms to get a response from the server but the real RTT is more like 0.5ms.

      So I have a few takeaways here:

      • I need to override localThresholdMS to prevent this from impacting the benchmark results
      • This could explain a lot of the flakey tests we see in evergreen.
      • Is there any way to improve this situation? Maybe 15ms is too low for a default localThresholdMS given that, under load, the RTT measurement can vary widely.

      Some of the test failures this might explain are:

      • PYTHON-2534 test_pool_paused_error_is_retryable
      • PYTHON-2526 test_server_selection_in_window.TestProse.test_load_balancing

            Assignee:
            shane.harvey@mongodb.com Shane Harvey
            Reporter:
            shane.harvey@mongodb.com Shane Harvey
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: