Summary
The SDAM Monitoring spec defines the streamable hello protocol as way of having the server send hello updates as soon as there is a change or until maxAwaitTimeoutMS is reached. In FAAS (functions as a service) environments process execution is frozen, so the driver cannot consume hello responses being sent every maxAwaitTimeoutMS. When the FAAS environment wakes up the driver must process every heartbeat that is waiting on the socket to be read. This causes performance delays compared to typical environments.
An associated bug that the Node.js driver encountered specifically was that the FAAS environment allows timers continue until expiration between invocations but it keeps execution frozen. Once the FAAS wakes up the Node.js driver processed socket timeout errors prior to reading from the socket. This order of operations is inherent to the Node.js environment, timers always come first in the event loop, but it is an indicator of potentially an additional issue with streaming in environments where the timeout execution time cannot be considered reliable.
We were able to solve this issue in Node.js by enforcing timeout errors to be handled after allowing the runtime to read from the socket. If the read succeeds then we were able to clear the erroneous timeout error, otherwise the timeout error is handled as normal. This required ordering could be worth encoding in the spec as part of fixing this related issue.
Motivation
Who is the affected end user?
FAAS users.
How does this affect the end user?
Performance concerns, or out of date TopologyDescription.
How likely is it that this problem or use case will occur?
Main path. The bug is not a blocker, it will occur consistently on every invocation. The wider the gap between invocations in relation to the heartBeatFrequencyMS setting the larger the number of heartbeats that need to be processed.
If the problem does occur, what are the consequences and how severe are they?
FAAS environments are usually designed around charging per execution, factoring in CPU time and memory usage. The common potential for heartbeats to pile up on the socket has an impact on these metrics.
The issue is mitigated by the limits imposed by TCP flow control (eventually the send and receive buffers fill up), but still can result in thousands of hello responses needing to be processed.
Is this issue urgent?
I think investigating a solution has "Major" (from JIRA) priority. There's been some proposals to consider adding a knob that forces the driver into polling mode but that comes with its own downsides (out of date TopologyDescription). Implementing the decided upon solution's priority can be considered on a per driver basis.
Is this ticket required by a downstream team?
No.
Is this ticket only for tests?
No.
- duplicates
-
DRIVERS-2578 Switch to polling monitoring when running within a FaaS environment
- Implementing
- is related to
-
PYTHON-3186 AWS Lambda/FaaS pause and resume behavior causes SDAM heartbeats to timeout
- Closed
-
NODE-4783 find() query stucks when primary switches back after stepDown() period is finished
- Closed
- related to
-
NODE-3810 AWS Lambda: MongoDB heartbeat failure.
- Closed
-
DRIVERS-1598 Solve for serverless/lambda connection pool issues
- Development Complete
- split to
-
CDRIVER-4492 Heartbeat build up with streaming protocol when driver process is stopped (FAAS)
- Closed
-
CSHARP-4352 Heartbeat build up with streaming protocol when driver process is stopped (FAAS)
- Closed
-
CXX-2593 Heartbeat build up with streaming protocol when driver process is stopped (FAAS)
- Closed
-
GODRIVER-2577 Heartbeat build up with streaming protocol when driver process is stopped (FAAS)
- Closed
-
MOTOR-1043 Heartbeat build up with streaming protocol when driver process is stopped (FAAS)
- Closed
-
NODE-4695 Heartbeat build up with streaming protocol when driver process is stopped (FAAS)
- Closed
-
PHPLIB-1005 Heartbeat build up with streaming protocol when driver process is stopped (FAAS)
- Closed
-
PYTHON-3463 Heartbeat build up with streaming protocol when driver process is stopped (FAAS)
- Closed
-
RUBY-3151 Heartbeat build up with streaming protocol when driver process is stopped (FAAS)
- Closed
-
RUST-1500 Heartbeat build up with streaming protocol when driver process is stopped (FAAS)
- Closed
-
JAVA-4760 Heartbeat build up with streaming protocol when driver process is stopped (FAAS)
- Closed