Uploaded image for project: 'Rust Driver'
  1. Rust Driver
  2. RUST-2131

bulk write result iteration is broken on load balanced topologies

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Unknown Unknown
    • 3.3.0
    • Affects Version/s: None
    • Component/s: None
    • None
    • None
    • Rust Drivers
    • Not Needed
    • None
    • None
    • None
    • None
    • None
    • None

      Example failures for various tests; the symptom is that a test will generally just wedge and eventually timeout:
      test::bulk_write::failed_cursor_iteration: https://parsley.mongodb.com/evergreen/mongo_rust_driver_load_balancer_test_load_balancer_latest_patch_509d27ea5b4e7a85928ef86802b940b156d8ad3c_677ecaae5d62220007e64203_25_01_08_18_57_51/0/task?bookmarks=0,2279
      test::bulk_write::successful_cursor_iteration:
      https://parsley.mongodb.com/evergreen/mongo_rust_driver_load_balancer_test_load_balancer_latest_patch_509d27ea5b4e7a85928ef86802b940b156d8ad3c_677ff50b8a489f000785f55e_25_01_09_16_11_03/0/task?bookmarks=0,2144,2145
      test::bulk_write::write_error_batches:
      https://parsley.mongodb.com/evergreen/mongo_rust_driver_load_balancer_test_load_balancer_latest_patch_509d27ea5b4e7a85928ef86802b940b156d8ad3c_6780002e45b6e500078e07cb_25_01_09_16_58_37/0/task?bookmarks=0,2191

      Debugging shows that the problem is that get_connection for the getMore executing in the bulk write's handle_response_async never returns: it's waiting on the take_connection call in the first match arm (the op has a pinned connection), and that never returns a value.

      Stepping back, from a systemic perspective, the problem is that:

      • cursor operations executing on a load-balanced topology are required to be pinned
      • when executed as part of handle_response_async, the connection is owned by the execute_operation_with_retry higher up the stack
      • the getMore executed will attempt to fetch the pinned connection, which blocks until the previous holder drops it
      • ... but that's execute_operation_with_retry, which won't drop it, it'll return it as part of the context of the completed operation
      • ... so deadlock.

      AFAICT this never worked for the specific combination of "bulk writes that required result iteration on load-balanced topologies", but because we were accidentally not running tests in load-balanced configuration we didn't notice until now.

            Assignee:
            isabel.atkinson@mongodb.com Isabel Atkinson
            Reporter:
            abraham.egnor@mongodb.com Abraham Egnor
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: