Summary
We observed a deadlock due to a recursive lock on a non-recursive mutex inside the mongoc-topology.c. This stuck the whole application. In the application we try to insert documents into a mongodb. The problem was discovered in an automated test intermediately start/stopping the mongodb.Backtrace further below.
Environment
mongo-c-driver 1.23.4
Build from source with:
-DENABLE_AUTOMATIC_INIT_AND_CLEANUP=OFF -DCMAKE_BUILD_TYPE=Debug -DCMAKE_INSTALL_PREFIX=/usr/local -DENABLE_TESTS=Off
mongo-cxx-driver 3.7.1
Build from source with:
-DCMAKE_BUILD_TYPE=Debug -DCMAKE_INSTALL_PREFIX=/usr/local -DCMAKE_CXX_STANDARD=17 -DENABLE_TESTS=Off ..
The test system running our application is docker container based on alpine 3.18. The host running the docker containers is a server with a x86_64 CPU running a Ubuntu 22.04.
The standalone mongodb was started as a separate docker container on the same host.
How to Reproduce
Try to insert documents into a collection of the mongodb in a loop using the mongo-cxx colletion.insert_one method. In case of an exception we sleep for some time and retry to insert the document again.
During execution of this tests we manually stopped/restarted an independent docker container with the standalone mongodb instance used by the test. Connection between application and host was performed by an docker internal bridge network.
It's important to use a linux distribution based on the libmusl (like the alpine distribution) instead of the glibc. The libmusl pthread_mutex_lock always does deadlock on a default initialized mutex in contrast to other implementations.
Additional Background
This was the backtrace we were able to extract from the stuck process:
(gdb) bt #0 __syscall_cp_c (nr=202, u=139930834176668, v=128, w=-2147483632, x=0, y=0, z=0) at ./arch/x86_64/syscall_arch.h:61 #1 0x00007f443082f702 in __futex4_cp (to=0x0, val=-2147483632, op=128, addr=0x7f442faa0a9c) at src/thread/__timedwait.c:24 #2 __timedwait_cp (addr=addr@entry=0x7f442faa0a9c, val=val@entry=-2147483632, clk=clk@entry=0, at=at@entry=0x0, priv=priv@entry=128) at src/thread/__timedwait.c:52 #3 0x00007f443082f7a7 in __timedwait (addr=addr@entry=0x7f442faa0a9c, val=-2147483632, clk=clk@entry=0, at=at@entry=0x0, priv=priv@entry=128) at src/thread/__timedwait.c:68 #4 0x00007f44308319ea in __pthread_mutex_timedlock (m=0x7f442faa0a98, at=0x0) at src/thread/pthread_mutex_timedlock.c:85 #5 0x00007f442fd6fabe in mc_tpld_modify_begin (tpl=0x7f442faa09d0) at /tmp/mongo-c-driver-1.23.4/src/libmongoc/src/mongoc/mongoc-topology.c:2014 #6 0x00007f442fd6c4d5 in _mongoc_topology_scanner_setup_err_cb (id=1, data=0x7f442faa09d0, error=0x7f442faaca80) at /tmp/mongo-c-driver-1.23.4/src/libmongoc/src/mongoc/mongoc-topology.c:133 #7 0x00007f442fd7754e in mongoc_topology_scanner_node_setup ( node=0x7f442faac800, error=0x7f442faaca80) at /tmp/mongo-c-driver-1.23.4/src/libmongoc/src/mongoc/mongoc-topology-scanner.c:1135 --Type <RET> for more, q to quit, c to continue without paging-- #8 0x00007f442fd762a3 in mongoc_topology_scanner_scan (ts=0x7f442fa9c080, id=1) at /tmp/mongo-c-driver-1.23.4/src/libmongoc/src/mongoc/mongoc-topology-scanner.c:566 #9 0x00007f442fd6c62f in _mongoc_topology_scanner_cb (id=1, hello_response=0x0, rtt_msec=0, data=0x7f442faa09d0, error=0x7f442f43a0c0) at /tmp/mongo-c-driver-1.23.4/src/libmongoc/src/mongoc/mongoc-topology.c:192 #10 0x00007f442fd76b2a in _async_error_or_timeout (acmd=0x7f442f43a080, duration_usec=11, default_err_msg=0x7f442fdac138 "connection error") at /tmp/mongo-c-driver-1.23.4/src/libmongoc/src/mongoc/mongoc-topology-scanner.c:806 #11 0x00007f442fd76c59 in _async_handler (acmd=0x7f442f43a080, async_status=MONGOC_ASYNC_CMD_ERROR, hello_response=0x0, duration_usec=11) at /tmp/mongo-c-driver-1.23.4/src/libmongoc/src/mongoc/mongoc-topology-scanner.c:845 #12 0x00007f442fd0848f in mongoc_async_cmd_run (acmd=0x7f442f43a080) at /tmp/mongo-c-driver-1.23.4/src/libmongoc/src/mongoc/mongoc-async-cmd.c:129 #13 0x00007f442fd0810b in mongoc_async_run (async=0x7f442faba9e0) at /tmp/mongo-c-driver-1.23.4/src/libmongoc/src/mongoc/mongoc-async.c:160 #14 0x00007f442fd7784a in mongoc_topology_scanner_work (ts=0x7f442fa9c080) at /tmp/mongo-c-driver-1.23.4/src/libmongoc/src/mongoc/mongoc-topology-scann--Type <RET> for more, q to quit, c to continue without paging-- er.c:1306 #15 0x00007f442fd6da53 in mongoc_topology_scan_once (topology=0x7f442faa09d0, obey_cooldown=true) at /tmp/mongo-c-driver-1.23.4/src/libmongoc/src/mongoc/mongoc-topology.c:906 #16 0x00007f442fd6daa4 in _mongoc_topology_do_blocking_scan ( topology=0x7f442faa09d0, error=0x7f442f80b9a0) at /tmp/mongo-c-driver-1.23.4/src/libmongoc/src/mongoc/mongoc-topology.c:931 #17 0x00007f442fd6e40d in mongoc_topology_select_server_id ( topology=0x7f442faa09d0, optype=MONGOC_SS_WRITE, read_prefs=0x0, must_use_primary=0x7f442f80bc43, error=0x7f442f80bef0) at /tmp/mongo-c-driver-1.23.4/src/libmongoc/src/mongoc/mongoc-topology.c:1234 #18 0x00007f442fd1d3ce in _mongoc_cluster_select_server_id (cs=0x0, topology=0x7f442faa09d0, optype=MONGOC_SS_WRITE, read_prefs=0x0, must_use_primary=0x7f442f80bc43, error=0x7f442f80bef0) at /tmp/mongo-c-driver-1.23.4/src/libmongoc/src/mongoc/mongoc-cluster.c:2757 #19 0x00007f442fd1d4b3 in _mongoc_cluster_stream_for_optype ( cluster=0x7f442faa0bd8, optype=MONGOC_SS_WRITE, read_prefs=0x0, cs=0x0, reply=0x7f442f80bd80, error=0x7f442f80bef0) at /tmp/mongo-c-driver-1.23.4/src/libmongoc/src/mongoc/mongoc-cluster.c:2805--Type <RET> for more, q to quit, c to continue without paging-- #20 0x00007f442fd1d659 in mongoc_cluster_stream_for_writes ( cluster=0x7f442faa0bd8, cs=0x0, reply=0x7f442f80bd80, error=0x7f442f80bef0) at /tmp/mongo-c-driver-1.23.4/src/libmongoc/src/mongoc/mongoc-cluster.c:2864 #21 0x00007f442fd0c6b3 in mongoc_bulk_operation_execute (bulk=0x7f443024b780, reply=0x7f442f80bd80, error=0x7f442f80bef0) at /tmp/mongo-c-driver-1.23.4/src/libmongoc/src/mongoc/mongoc-bulk-operation.c:820 #22 0x00007f44301128a9 in mongocxx::v_noabi::bulk_write::execute ( this=0x7f442f80c230) at /tmp/mongo-cxx-driver-r3.7.1/src/mongocxx/bulk_write.cpp:183 #23 0x00007f443012c418 in mongocxx::v_noabi::collection::_insert_one ( this=0x7f442f80c630, session=0x0, document=..., options=...) at /tmp/mongo-cxx-driver-r3.7.1/src/mongocxx/collection.cpp:546 #24 0x00007f443012c744 in mongocxx::v_noabi::collection::insert_one ( this=0x7f442f80c630, document=..., options=...) at /tmp/mongo-cxx-driver-r3.7.1/src/mongocxx/collection.cpp:557
The interesting parts are #9 together with #6. In function _mongoc_topology_scanner_cb (the one from #9) in line 170 a topology modification is started with mc_tpld_modify_begin. This will be commited in line 200. But in the error handling the code will initiate a new mongoc_topology_scanner_scan from line 191. This will lead to the #6 code fragment, which will try to start a new topology modification as well.
To ensure both code fragments work on the same data structure I additionally checked the topology pointers in both locations. And they are working on the same object.
(gdb) up 6
#6 0x00007f442fd6c4d5 in _mongoc_topology_scanner_setup_err_cb (id=1,
data=0x7f442faa09d0, error=0x7f442faaca80)
at /tmp/mongo-c-driver-1.23.4/src/libmongoc/src/mongoc/mongoc-topology.c:133
133 /tmp/mongo-c-driver-1.23.4/src/libmongoc/src/mongoc/mongoc-topology.c: No such file or directory.
(gdb) p topology
$8 = (mongoc_topology_t *) 0x7f442faa09d0
(gdb) up 3
#9 0x00007f442fd6c62f in _mongoc_topology_scanner_cb (id=1,
hello_response=0x0, rtt_msec=0, data=0x7f442faa09d0, error=0x7f442f43a0c0)
at /tmp/mongo-c-driver-1.23.4/src/libmongoc/src/mongoc/mongoc-topology.c:192
192 in /tmp/mongo-c-driver-1.23.4/src/libmongoc/src/mongoc/mongoc-topology.c
(gdb) p topology
$9 = (mongoc_topology_t * const) 0x7f442faa09d0
- is depended on by
-
PHPC-2265 Possible deadlock when querying during a MongoDB node restart
- Closed