-
Type: Bug
-
Resolution: Fixed
-
Priority: Critical - P2
-
Affects Version/s: 6.5.0, 6.6.2
-
Component/s: Connection Layer
Our issue has been ongoing for several versions now and have not been able to go beyond 5.9.2 version of the driver. I believe it is the same as NODE-6166 which was closed without finding the issue.
We have both `find` and `bulkWrite` operations that will behave similarly. We only see the issue in production, and it occurs roughly 1 in 3,000 of our updates, and each of those updates involves 4 or 5 bulkWrite operations. We have moderately heavy load, running about ~4,000 updates per second across all nodes, which is about 40/s on each node.
Seemingly randomly, those operations never return, but there is no active connection to the db server. I have viewed internals, and a connection from the connection pool is in use for each of the updates that hang. I believe the problem must be in the reading of the response on the socket, but this driver code has changed so much since 5.9.2 it is very difficult to know what it could be.
We are connected to Atlas with a 3 replica cluster. No unusual activity on server shows in monitoring. Although random and rare relative to the number of updates, it is easy to reproduce when making many updates, although I have not been able to reproduce locally with simulated load testing.
The server is running v5.
- related to
-
NODE-6166 Write Ops Are Persisted But Sometimes Driver Function Does Not Return
- Closed