-
Type: Bug
-
Resolution: Fixed
-
Priority: Critical - P2
-
Affects Version/s: 6.0.16, 7.3.3, 7.0.12, 5.0.28, 8.0.0-rc15
-
Component/s: None
-
None
-
Server Programmability
-
Fully Compatible
-
ALL
-
v8.0
-
Programmability 2024-12-23, Programmability 2025-01-06, Programmability 2025-01-20
-
(copied to CRM)
libunwind calls dl_iterate_phdr here. That call is actually to a wrapper that tries to call the version from libc. Nonetheless, dl_iterate_phdr is ultimately called. dl_iterate_phdr internally takes a lock, so if we're already running it when we try to take a stack trace from every thread, we'll deadlock. This can happen, when RSTL lock acquisition fails and we try to print stack traces for all threads. However, because we're killing a bunch of operations at the time, many threads are processing unhandled exceptions with stacktraces like this:
#8 <signal handler called>
#9 0x0000ffff8af338b4 in __lll_lock_wait () from /lib64/libpthread.so.0
#10 0x0000ffff8af2bed4 in pthread_mutex_lock () from /lib64/libpthread.so.0
#11 0x0000ffff8aeb4af4 in dl_iterate_phdr () from /lib64/libc.so.6
#12 0x0000ffff8af67a18 in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1
#13 0x0000ffff8af642f4 in ?? () from /lib64/libgcc_s.so.1
#14 0x0000ffff8af652c0 in ?? () from /lib64/libgcc_s.so.1
#15 0x0000ffff8af65840 in _Unwind_RaiseException () from /lib64/libgcc_s.so.1
#16 0x0000aaaac67cdaf4 in __cxa_throw ()
#17 0x0000aaaac376f718 in mongo::error_details::throwExceptionForStatus(mongo::Status const&) ()
#18 0x0000aaaac3775048 in mongo::iassertFailed(mongo::Status const&, mongo::SourceLocation) ()
#19 0x0000aaaac36dd94c in _ZZZN5mongo13Interruptible32waitForConditionOrInterruptUntilISt11unique_lockISt5mutexEZNS_14future_details15SharedStateBase4waitEPS0_EUlvE_EEbRNS_4stdx18condition_variableERT_NS_6Date_tET0_ENKUlNS_6StatusENS0_9WakeSpeedEE_clESG_SH_ENKUlvE_clEv.isra.268 ()
When libunwind calls dl_iterate_phdr, we have a deadlock.
To be clear, this issue could happen anywhere we try to print stack traces from a signal handler, it's just particularly likely to occur in this circumstance due to the combination of killing lots of operations and printing stacktraces for all threads.
This commit in libunwind introduced a mechanism to substitute a custom implementation for dl_iterate_phdr. If we got up to date with that, we could potentially cache the results from dl_iterate_phdr (like we do here) and feed them to libunwind via that mechanism.
- depends on
-
SERVER-98185 upgrade "nongnu" libunwind to v1.8.1
- Closed
- is depended on by
-
SERVER-97887 Enable SIGUSR2 stress testing in config_fuzzer_stress tasks
- Needs Scheduling
-
SERVER-92548 Add a command to make mongostream dump thread stack traces.
- In Code Review
-
SERVER-91012 Recommit SERVER-71520
- Needs Scheduling
- is related to
-
SERVER-83271 Make synchronous signal handlers signal-safe
- Open
- related to
-
SERVER-99080 Complete TODO listed in SERVER-90775
- In Code Review
-
SERVER-90777 Revert SERVER-71520
- Closed
-
SERVER-93337 Avoid dumping thread stacks on timeout in ThroughputProbing
- Closed
-
SERVER-93365 Indicate that printAllThreadsStacksBlocking should not be used until deadlock scenario is fixed
- Closed
-
SERVER-95489 Explore if we can use curOp to log all active operations if we fail to acquire the RSTL during step up/step down
- Open