-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: SDAM
-
None
For a client pool, the background topology scanner runs a complete scan of all servers after heartbeatFrequencyMS has passed (or sooner, if a scan is requested).
The background scan uses mongoc_topology_scan_once. This fans out "ismaster" commands and waits for all responses before another scan can be scheduled.
A big problem with this, is that a slow server could block the next scheduled scan of all other servers. The timeout of an "ismaster" in scanning is connectTimeoutMS, which may exceed heartbeatFrequencyMS. This scenario can easily happen:
1. Scan requested
2. "ismaster" is sent to servers X and Y
3. X responds quickly, but Y hangs for connectTimeoutMS.
4. The "ismaster" to X times out.
5. The background thread sees that more than heartbeatFrequencyMS has passed and starts a new complete scan.
I've reproduced this behavior by modifying example-sdam-monitoring.c. It overrides the stream initializer to simulate a slow connection to one server.
https://gist.github.com/kevinAlbs/1eb3fd42a2b17d71f99e4d9389661069
Running it against a two node replica set shows the behavior:
$ example-sdam-monitoring "mongodb://localhost:27017,localhost:27018/?connectTimeoutMS=20000&heartbeatFrequencyMS=1000" ... 2020/04/15 13:37:54.0521: [78700]: DEBUG: mongoc: localhost:27017 heartbeat started 2020/04/15 13:37:54.0524: [78700]: DEBUG: mongoc: localhost:27018 heartbeat started ... 2020/04/15 13:38:14.0633: [78700]: DEBUG: mongoc: localhost:27018 heartbeat failed: socket timeout calling ismaster on 'localhost:27018' 2020/04/15 13:38:15.0137: [78700]: DEBUG: mongoc: localhost:27017 heartbeat started
The second heartbeat to localhost:27017 is blocked by the 20 second connection timeout.
This behavior is unavoidable for single-threaded scans, but should not be the case for multi-threaded scans. Servers should be scanned at their own intervals (which also better aligns with the server monitoring spec).
- is depended on by
-
CDRIVER-3678 /Topology/request_scan_on_error failing
- Closed
-
CDRIVER-3535 Reduce Client Time To Recovery On Topology Changes
- Closed
- related to
-
CDRIVER-3701 Calling topology TRACE macro with no formatted args emits compiler warning
- Closed
-
CDRIVER-3682 Follow-up to thread-per-server monitoring
- Backlog
-
CDRIVER-3722 Update documentation for multi-threaded scanning behavior
- Closed