Example failures for various tests; the symptom is that a test will generally just wedge and eventually timeout:
test::bulk_write::failed_cursor_iteration: https://parsley.mongodb.com/evergreen/mongo_rust_driver_load_balancer_test_load_balancer_latest_patch_509d27ea5b4e7a85928ef86802b940b156d8ad3c_677ecaae5d62220007e64203_25_01_08_18_57_51/0/task?bookmarks=0,2279
test::bulk_write::successful_cursor_iteration:
https://parsley.mongodb.com/evergreen/mongo_rust_driver_load_balancer_test_load_balancer_latest_patch_509d27ea5b4e7a85928ef86802b940b156d8ad3c_677ff50b8a489f000785f55e_25_01_09_16_11_03/0/task?bookmarks=0,2144,2145
test::bulk_write::write_error_batches:
https://parsley.mongodb.com/evergreen/mongo_rust_driver_load_balancer_test_load_balancer_latest_patch_509d27ea5b4e7a85928ef86802b940b156d8ad3c_6780002e45b6e500078e07cb_25_01_09_16_58_37/0/task?bookmarks=0,2191
Debugging shows that the problem is that get_connection for the getMore executing in the bulk write's handle_response_async never returns: it's waiting on the take_connection call in the first match arm (the op has a pinned connection), and that never returns a value.
Stepping back, from a systemic perspective, the problem is that:
- cursor operations executing on a load-balanced topology are required to be pinned
- when executed as part of handle_response_async, the connection is owned by the execute_operation_with_retry higher up the stack
- the getMore executed will attempt to fetch the pinned connection, which blocks until the previous holder drops it
- ... but that's execute_operation_with_retry, which won't drop it, it'll return it as part of the context of the completed operation
- ... so deadlock.
AFAICT this never worked for the specific combination of "bulk writes that required result iteration on load-balanced topologies", but because we were accidentally not running tests in load-balanced configuration we didn't notice until now.