Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Unknown
Fix Version/s: 3.3.0
Affects Version/s: None
Component/s: None
Labels:
None

Quarter:
- FY26Q1
Confidence Status:
None

Assigned Teams:

Rust Drivers

Documentation Changes:
Not Needed

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

Example failures for various tests; the symptom is that a test will generally just wedge and eventually timeout:
test::bulk_write::failed_cursor_iteration: https://parsley.mongodb.com/evergreen/mongo_rust_driver_load_balancer_test_load_balancer_latest_patch_509d27ea5b4e7a85928ef86802b940b156d8ad3c_677ecaae5d62220007e64203_25_01_08_18_57_51/0/task?bookmarks=0,2279
test::bulk_write::successful_cursor_iteration:
https://parsley.mongodb.com/evergreen/mongo_rust_driver_load_balancer_test_load_balancer_latest_patch_509d27ea5b4e7a85928ef86802b940b156d8ad3c_677ff50b8a489f000785f55e_25_01_09_16_11_03/0/task?bookmarks=0,2144,2145
test::bulk_write::write_error_batches:
https://parsley.mongodb.com/evergreen/mongo_rust_driver_load_balancer_test_load_balancer_latest_patch_509d27ea5b4e7a85928ef86802b940b156d8ad3c_6780002e45b6e500078e07cb_25_01_09_16_58_37/0/task?bookmarks=0,2191

Debugging shows that the problem is that get_connection for the getMore executing in the bulk write's handle_response_async never returns: it's waiting on the take_connection call in the first match arm (the op has a pinned connection), and that never returns a value.

Stepping back, from a systemic perspective, the problem is that:

cursor operations executing on a load-balanced topology are required to be pinned
when executed as part of handle_response_async, the connection is owned by the execute_operation_with_retry higher up the stack
the getMore executed will attempt to fetch the pinned connection, which blocks until the previous holder drops it
... but that's execute_operation_with_retry, which won't drop it, it'll return it as part of the context of the completed operation
... so deadlock.

AFAICT this never worked for the specific combination of "bulk writes that required result iteration on load-balanced topologies", but because we were accidentally not running tests in load-balanced configuration we didn't notice until now.

Assignee:: Isabel Atkinson

Reporter:: Abraham Egnor

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: Jan 08 2025 09:28:40 PM UTC

Updated:: Apr 18 2025 02:42:06 PM UTC

Resolved:: Apr 18 2025 02:42:06 PM UTC

Confidence Status Last Update:: 07/Apr/25 3:12 PM

Details

Description

Attachments

Activity

People

Dates