Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-90775

libunwind deadlocks when called from signal handler while dl_iterate_phdr is running

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Critical - P2 Critical - P2
    • 8.1.0-rc0
    • Affects Version/s: 6.0.16, 7.3.3, 7.0.12, 5.0.28, 8.0.0-rc15
    • Component/s: None
    • None
    • Server Programmability
    • Fully Compatible
    • ALL
    • v8.0
    • Programmability 2024-12-23, Programmability 2025-01-06, Programmability 2025-01-20

      libunwind calls dl_iterate_phdr here. That call is actually to a wrapper that tries to call the version from libc. Nonetheless, dl_iterate_phdr is ultimately called. dl_iterate_phdr internally takes a lock, so if we're already running it when we try to take a stack trace from every thread, we'll deadlock. This can happen, when RSTL lock acquisition fails and we try to print stack traces for all threads. However, because we're killing a bunch of operations at the time, many threads are processing unhandled exceptions with stacktraces like this:

      #8 <signal handler called>
      #9 0x0000ffff8af338b4 in __lll_lock_wait () from /lib64/libpthread.so.0
      #10 0x0000ffff8af2bed4 in pthread_mutex_lock () from /lib64/libpthread.so.0
      #11 0x0000ffff8aeb4af4 in dl_iterate_phdr () from /lib64/libc.so.6
      #12 0x0000ffff8af67a18 in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1
      #13 0x0000ffff8af642f4 in ?? () from /lib64/libgcc_s.so.1
      #14 0x0000ffff8af652c0 in ?? () from /lib64/libgcc_s.so.1
      #15 0x0000ffff8af65840 in _Unwind_RaiseException () from /lib64/libgcc_s.so.1
      #16 0x0000aaaac67cdaf4 in __cxa_throw ()
      #17 0x0000aaaac376f718 in mongo::error_details::throwExceptionForStatus(mongo::Status const&) ()
      #18 0x0000aaaac3775048 in mongo::iassertFailed(mongo::Status const&, mongo::SourceLocation) ()
      #19 0x0000aaaac36dd94c in _ZZZN5mongo13Interruptible32waitForConditionOrInterruptUntilISt11unique_lockISt5mutexEZNS_14future_details15SharedStateBase4waitEPS0_EUlvE_EEbRNS_4stdx18condition_variableERT_NS_6Date_tET0_ENKUlNS_6StatusENS0_9WakeSpeedEE_clESG_SH_ENKUlvE_clEv.isra.268 ()

      When libunwind calls dl_iterate_phdr, we have a deadlock.

      To be clear, this issue could happen anywhere we try to print stack traces from a signal handler, it's just particularly likely to occur in this circumstance due to the combination of killing lots of operations and printing stacktraces for all threads.

      This commit in libunwind introduced a mechanism to substitute a custom implementation for dl_iterate_phdr. If we got up to date with that, we could potentially cache the results from dl_iterate_phdr (like we do here) and feed them to libunwind via that mechanism.

            Assignee:
            billy.donahue@mongodb.com Billy Donahue
            Reporter:
            ryan.berryhill@mongodb.com Ryan Berryhill
            Votes:
            3 Vote for this issue
            Watchers:
            29 Start watching this issue

              Created:
              Updated:
              Resolved: