Uploaded image for project: 'Drivers'
  1. Drivers
  2. DRIVERS-2246

Heartbeat build up with streaming protocol when driver process is stopped (FAAS)

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Component/s: FaaS, SDAM
    • None
    • Not Needed
    • $i18n.getText("admin.common.words.hide")
      Key Status/Resolution FixVersion
      CDRIVER-4492 Won't Do
      CXX-2593 Won't Do
      CSHARP-4352 Won't Do
      GODRIVER-2577 Fixed 1.12.0, 1.9.4, 1.10.5, 1.11.1, 1.12.0-alpha1
      JAVA-4760 Won't Do
      NODE-4695 Won't Do
      MOTOR-1043 Won't Do
      PYTHON-3463 Won't Do
      PHPLIB-1005 Won't Do
      RUBY-3151 Won't Do
      RUST-1500 Won't Do
      SWIFT-1649 Won't Do
      $i18n.getText("admin.common.words.show")
      #scriptField, #scriptField *{ border: 1px solid black; } #scriptField{ border-collapse: collapse; } #scriptField td { text-align: center; /* Center-align text in table cells */ } #scriptField td.key { text-align: left; /* Left-align text in the Key column */ } #scriptField a { text-decoration: none; /* Remove underlines from links */ border: none; /* Remove border from links */ } /* Add green background color to cells with FixVersion */ #scriptField td.hasFixVersion { background-color: #00FF00; /* Green color code */ } #scriptField td.willNotDo { background-color: #FF0000; /* Red color code */ } /* Center-align the first row headers */ #scriptField th { text-align: center; } Key Status/Resolution FixVersion CDRIVER-4492 Won't Do CXX-2593 Won't Do CSHARP-4352 Won't Do GODRIVER-2577 Fixed 1.12.0, 1.9.4, 1.10.5, 1.11.1, 1.12.0-alpha1 JAVA-4760 Won't Do NODE-4695 Won't Do MOTOR-1043 Won't Do PYTHON-3463 Won't Do PHPLIB-1005 Won't Do RUBY-3151 Won't Do RUST-1500 Won't Do SWIFT-1649 Won't Do

      Summary

      The SDAM Monitoring spec defines the streamable hello protocol as way of having the server send hello updates as soon as there is a change or until maxAwaitTimeoutMS is reached. In FAAS (functions as a service) environments process execution is frozen, so the driver cannot consume hello responses being sent every maxAwaitTimeoutMS. When the FAAS environment wakes up the driver must process every heartbeat that is waiting on the socket to be read. This causes performance delays compared to typical environments. 

      An associated bug that the Node.js driver encountered specifically was that the FAAS environment allows timers continue until expiration between invocations but it keeps execution frozen. Once the FAAS wakes up the Node.js driver processed socket timeout errors prior to reading from the socket. This order of operations is inherent to the Node.js environment, timers always come first in the event loop, but it is an indicator of potentially an additional issue with streaming in environments where the timeout execution time cannot be considered reliable.

      We were able to solve this issue in Node.js by enforcing timeout errors to be handled after allowing the runtime to read from the socket. If the read succeeds then we were able to clear the erroneous timeout error, otherwise the timeout error is handled as normal. This required ordering could be worth encoding in the spec as part of fixing this related issue.

      Motivation

      Who is the affected end user?

      FAAS users.

      How does this affect the end user?

      Performance concerns, or out of date TopologyDescription.

      How likely is it that this problem or use case will occur?

      Main path. The bug is not a blocker, it will occur consistently on every invocation. The wider the gap between invocations in relation to the heartBeatFrequencyMS setting the larger the number of heartbeats that need to be processed.

      If the problem does occur, what are the consequences and how severe are they?

      FAAS environments are usually designed around charging per execution, factoring in CPU time and memory usage. The common potential for heartbeats to pile up on the socket has an impact on these metrics.
      The issue is mitigated by the limits imposed by TCP flow control (eventually the send and receive buffers fill up), but still can result in thousands of hello responses needing to be processed.

      Is this issue urgent?

      I think investigating a solution has "Major" (from JIRA) priority. There's been some proposals to consider adding a knob that forces the driver into polling mode but that comes with its own downsides (out of date TopologyDescription). Implementing the decided upon solution's priority can be considered on a per driver basis.

      Is this ticket required by a downstream team?

      No.

      Is this ticket only for tests?

      No.

            Assignee:
            Unassigned Unassigned
            Reporter:
            neal.beeken@mongodb.com Neal Beeken
            Votes:
            0 Vote for this issue
            Watchers:
            16 Start watching this issue

              Created:
              Updated:
              Resolved: