-
Type: Improvement
-
Resolution: Done
-
Priority: Critical - P2
-
None
-
Affects Version/s: 1.5.2
-
Component/s: Connections
-
None
-
(copied to CRM)
-
Needed
-
Problem
We found the driver unnecessarily closes connections and clears the connection pool under high load.
This occurs when the semaphore wait time to acquire a connection approaches the context timeout. If a connection is acquired with little to no context deadline left the connection is closed as any use of the connection results in a timeout. After the connection is closed other go routines will attempt to open a connection with a similarly low deadline; when the new connection fails to create, the entire pool is cleared (generation iterated). This non-virtuous cycle repeats and both increases error rates and cluster cpu (to serve creating the new connections).
Proposed Solution
- Publish metrics for connection pool checkout duration (semaphore wait time)
- Prevent closing connections when remaining deadline is below a threshold. This can be accomplished in one of a few ways:
- Add a client option for minimum connection io duration. After acquiring a connection if the context has a deadline and the remaining duration is below the minimum connection io duration fail fast before attempting to use the connection.
- Add a client option for maximum connection pool checkout duration (semaphore wait duration). If the context has a deadline and the deadline is greater than the maximum checkout duration, call acquire with a new context with a deadline equal to the maximum semaphore wait time.
example error pattern:
time="2021-05-24T14:29:24-07:00" level=info msg=mongo_pool_event activity=true connection_id=0 reason=timeout type=ConnectionCheckOutFailedSemaphore time="2021-05-24T14:29:24-07:00" level=info msg=mongo_pool_event activity=true connection_id=0 reason="ProcessHandshakeError: connection() error occured during connection handshake: context deadline exceeded" type=ConnectionPoolCleared
A way to replicate this problem locally is to run a script with high concurrency, low timeout and low maximum connection pool count.
- is related to
-
GODRIVER-2037 Don't clear the connection pool on Context timeout during handshake
- Closed
-
GODRIVER-2038 Use "ConnectionTimeout" for creating all new connections and background connection creation
- Closed