-
Type: Improvement
-
Resolution: Unresolved
-
Priority: Unknown
-
None
-
Component/s: SDAM
-
None
-
Needed
-
Summary
There are 3 SDAM spec tests that assert that the server monitor handles check (aka. heartbeat) errors after the initial handshake correctly:
Each of those tests registers a failpoint for "hello" that causes a specific type of error. The failpoint uses "times: 2" because both the server monitor and RTT monitor send "hello" operations to the server at roughly the same interval, so it's expected that both monitors may trigger a failpoint. However, sometimes the RTT monitor runs twice and triggers the failpoint 2 times before the server monitor runs again, leading to a test failure because the server monitor heartbeat never triggers a failpoint.
That can happen because the server monitor and RTT monitor run concurrently and use different timing mechanisms when awaitable "hello" is available. The server monitor using awaitable "hello" depends partially on server-side timing via "maxAwaitTimeMS" (see description here), while the RTT monitor timing is strictly driver-side (see description here). As a result, it's possible for the RTT monitor to run more than once before an in-progress awaitable "hello" returns and attempts to start a new "hello" that would trigger the failpoint.
We can significantly reduce the probability of intermittent failures by increasing the number of times the failpoint can be triggered. If we do that, we also need to remove the assertion that exactly 1 "ServerMarkedUnknownEvent" and "PoolClearedEvent" events are fired (already done in the "Network error on Monitor check" spec test) because the server monitor would have a higher probability of triggering more than 1 failpoint.
Motivation
Who is the affected end user?
DBX devs.
How does this affect the end user?
The SDAM Command error on Monitor check, Network error on Monitor check, or Network timeout on Monitor check spec tests fail intermittently.
How likely is it that this problem or use case will occur?
The failure is caused by a race between the server monitor heartbeat loop and the RTT monitor loop. Depending on runtime conditions, the RTT monitor loop may run twice before the server monitor heartbeat loop runs once after the failpoint is registered. The observed failure rate in the Go Driver is around 5-10% if run individually, or around 1-2% if run with the rest of the test suite.
If the problem does occur, what are the consequences and how severe are they?
Pull request or waterfall Evergreen CI test runs may fail intermittently, leading to "false positive" test failures that create confusion, take time to troubleshoot, and possibly hide actual errors that are misinterpreted as errors due to a flaky test.
Is this issue urgent?
No.
Is this ticket required by a downstream team?
No.
Is this ticket only for tests?
Yes.
- is related to
-
GODRIVER-2464 Add timeout for RTT monitor "hello" operations
- Closed
- split to
-
CDRIVER-4426 Improve reliability of SDAM heartbeat error spec tests.
- Backlog
-
CXX-2544 Improve reliability of SDAM heartbeat error spec tests.
- Backlog
-
NODE-4504 Improve reliability of SDAM heartbeat error spec tests. Fix Runner socket leak issue
- Backlog
-
CSHARP-4252 Improve reliability of SDAM heartbeat error spec tests.
- Closed
-
GODRIVER-2490 Improve reliability of SDAM heartbeat error spec tests.
- Closed
-
JAVA-4677 Improve reliability of SDAM heartbeat error spec tests.
- Closed
-
MOTOR-993 Improve reliability of SDAM heartbeat error spec tests.
- Closed
-
NODE-4414 Improve reliability of SDAM heartbeat error spec tests.
- Closed
-
PHPLIB-910 Improve reliability of SDAM heartbeat error spec tests.
- Closed
-
PYTHON-3353 Improve reliability of SDAM heartbeat error spec tests.
- Closed
-
RUBY-3050 Improve reliability of SDAM heartbeat error spec tests.
- Closed
-
RUST-1407 Improve reliability of SDAM heartbeat error spec tests.
- Closed