Loading...

Type: Improvement
Resolution: Unresolved
Priority: Unknown
Fix Version/s: None
Component/s: SDAM
Labels:
None

Driver Changes:
Needed
Downstream Changes Summary:

Hide

Sync SDAM integration spec tests at revision 98e20daa.

Show
Sync SDAM integration spec tests at revision 98e20daa .

Driver Compliance:

$i18n.getText("admin.common.words.hide")

Key	Status/Resolution	FixVersion
CDRIVER-4426	Backlog
CXX-2544	Backlog
CSHARP-4252	Done	2.18.0
GODRIVER-2490	Done
JAVA-4677	Fixed	4.7.0
NODE-4414	Fixed	4.9.0
MOTOR-993	Duplicate
PYTHON-3353	Fixed	4.2
PHPLIB-910	Won't Do
RUBY-3050	Fixed	2.18.1
RUST-1407	Duplicate
SWIFT-1601	Duplicate
NODE-4504	Backlog

$i18n.getText("admin.common.words.show")

#scriptField, #scriptField *{ border: 1px solid black; } #scriptField{ border-collapse: collapse; } #scriptField td { text-align: center; /* Center-align text in table cells */ } #scriptField td.key { text-align: left; /* Left-align text in the Key column */ } #scriptField a { text-decoration: none; /* Remove underlines from links */ border: none; /* Remove border from links */ } /* Add green background color to cells with FixVersion */ #scriptField td.hasFixVersion { background-color: #00FF00; /* Green color code */ } #scriptField td.willNotDo { background-color: #FF0000; /* Red color code */ } /* Center-align the first row headers */ #scriptField th { text-align: center; } Key Status/Resolution FixVersion CDRIVER-4426 Backlog CXX-2544 Backlog CSHARP-4252 Done 2.18.0 GODRIVER-2490 Done JAVA-4677 Fixed 4.7.0 NODE-4414 Fixed 4.9.0 MOTOR-993 Duplicate PYTHON-3353 Fixed 4.2 PHPLIB-910 Won't Do RUBY-3050 Fixed 2.18.1 RUST-1407 Duplicate SWIFT-1601 Duplicate NODE-4504 Backlog

Summary

There are 3 SDAM spec tests that assert that the server monitor handles check (aka. heartbeat) errors after the initial handshake correctly:

Each of those tests registers a failpoint for "hello" that causes a specific type of error. The failpoint uses "times: 2" because both the server monitor and RTT monitor send "hello" operations to the server at roughly the same interval, so it's expected that both monitors may trigger a failpoint. However, sometimes the RTT monitor runs twice and triggers the failpoint 2 times before the server monitor runs again, leading to a test failure because the server monitor heartbeat never triggers a failpoint.

That can happen because the server monitor and RTT monitor run concurrently and use different timing mechanisms when awaitable "hello" is available. The server monitor using awaitable "hello" depends partially on server-side timing via "maxAwaitTimeMS" (see description here), while the RTT monitor timing is strictly driver-side (see description here). As a result, it's possible for the RTT monitor to run more than once before an in-progress awaitable "hello" returns and attempts to start a new "hello" that would trigger the failpoint.

We can significantly reduce the probability of intermittent failures by increasing the number of times the failpoint can be triggered. If we do that, we also need to remove the assertion that exactly 1 "ServerMarkedUnknownEvent" and "PoolClearedEvent" events are fired (already done in the "Network error on Monitor check" spec test) because the server monitor would have a higher probability of triggering more than 1 failpoint.

Motivation

Who is the affected end user?

DBX devs.

How does this affect the end user?

The SDAM Command error on Monitor check, Network error on Monitor check, or Network timeout on Monitor check spec tests fail intermittently.

How likely is it that this problem or use case will occur?

The failure is caused by a race between the server monitor heartbeat loop and the RTT monitor loop. Depending on runtime conditions, the RTT monitor loop may run twice before the server monitor heartbeat loop runs once after the failpoint is registered. The observed failure rate in the Go Driver is around 5-10% if run individually, or around 1-2% if run with the rest of the test suite.

If the problem does occur, what are the consequences and how severe are they?

Pull request or waterfall Evergreen CI test runs may fail intermittently, leading to "false positive" test failures that create confusion, take time to troubleshoot, and possibly hide actual errors that are misinterpreted as errors due to a flaky test.

Is this issue urgent?

No.

Is this ticket required by a downstream team?

No.

Is this ticket only for tests?

Yes.

is related to

GODRIVER-2464 Add timeout for RTT monitor "hello" operations

Closed

split to

CDRIVER-4426 Improve reliability of SDAM heartbeat error spec tests.

Backlog

CXX-2544 Improve reliability of SDAM heartbeat error spec tests.

Backlog

NODE-4504 Improve reliability of SDAM heartbeat error spec tests. Fix Runner socket leak issue

Backlog

CSHARP-4252 Improve reliability of SDAM heartbeat error spec tests.

Closed

GODRIVER-2490 Improve reliability of SDAM heartbeat error spec tests.

Closed

JAVA-4677 Improve reliability of SDAM heartbeat error spec tests.

Closed

MOTOR-993 Improve reliability of SDAM heartbeat error spec tests.

Closed

NODE-4414 Improve reliability of SDAM heartbeat error spec tests.

Closed

PHPLIB-910 Improve reliability of SDAM heartbeat error spec tests.

Closed

PYTHON-3353 Improve reliability of SDAM heartbeat error spec tests.

Closed

RUBY-3050 Improve reliability of SDAM heartbeat error spec tests.

Closed

RUST-1407 Improve reliability of SDAM heartbeat error spec tests.

Closed

(8 split to)

Details

Description

Summary

Motivation

Who is the affected end user?

How does this affect the end user?

How likely is it that this problem or use case will occur?

If the problem does occur, what are the consequences and how severe are they?

Is this issue urgent?

Is this ticket required by a downstream team?

Is this ticket only for tests?

Attachments

Issue Links

Activity

People

Dates