Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-56784

The replication thread of secondary hang up

    • Type: Icon: Bug Bug
    • Resolution: Community Answered
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 4.0.9, 4.0.19
    • Component/s: None
    • None
    • ALL

      Recently, We encountered a strange phenomenon
      some 4.0 mongodb sharding cluster , The replication of secondary hang up. So the lag between primary and secondary have growing so large.
       
      I have colloect the pstack data of mongod.
       
      we can know that 16 replWriterThread is waiting for tasks, meaning they are idle。
      ```
      #0 futex_wait_cancelable (private=0, expected=0, futex_word=0x5580fc7dd458) at ../sysdeps/unix/sysv/linux/futex-internal.h:88#1 _pthread_cond_wait_common (abstime=0x0, mutex=0x5580fc7dd400, cond=0x5580fc7dd430) at pthread_cond_wait.c:502#2 __pthread_cond_wait (cond=0x5580fc7dd430, mutex=0x5580fc7dd400) at pthread_cond_wait.c:655#3 0x00005580f5f7ceec in std::condition_variable::wait(std::unique_lock<std::mutex>&) ()#4 0x00005580f5632750 in mongo::ThreadPool::_consumeTasks() ()#5 0x00005580f5632e86 in mongo::ThreadPool::_workerThreadBody(mongo::ThreadPool*, std::_cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()#6 0x00005580f56331be in std::thread::_Impl<std::_Bind_simple<mongo::stdx::thread::thread<mongo::ThreadPool::_startWorkerThread_inlock()::{lambda()#1}, , 0>(mongo::ThreadPool::_startWorkerThread_inlock()::{lambda()#1})::{lambda()#1} ()> >::_M_run() ()#7 0x00005580f5f7ff60 in execute_native_thread_routine ()#8 0x00007fd5151a2fa3 in start_thread (arg=<optimized out>) at pthread_create.c:486#9 0x00007fd5150d14cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
      ```
       
      but the batcher thread is waitForIdle for repl thread.
      ```
      #0 futex_wait_cancelable (private=0, expected=0, futex_word=0x5580fc7dd48c) at ../sysdeps/unix/sysv/linux/futex-internal.h:88#1 __pthread_cond_wait_common (abstime=0x0, mutex=0x5580fc7dd400, cond=0x5580fc7dd460) at pthread_cond_wait.c:502#2 __pthread_cond_wait (cond=0x5580fc7dd460, mutex=0x5580fc7dd400) at pthread_cond_wait.c:655#3 0x00005580f5f7ceec in std::condition_variable::wait(std::unique_lock<std::mutex>&) ()#4 0x00005580f56311bb in mongo::ThreadPool::waitForIdle() ()#5 0x00005580f4816d91 in mongo::repl::SyncTail::multiApply(mongo::OperationContext*, std::vector<mongo::repl::OplogEntry, std::allocator<mongo::repl::OplogEntry> >) ()#6 0x00005580f48186e3 in mongo::repl::SyncTail::_oplogApplication(mongo::repl::OplogBuffer*, mongo::repl::ReplicationCoordinator*, mongo::repl::SyncTail::OpQueueBatcher*) ()#7 0x00005580f48198c3 in mongo::repl::SyncTail::oplogApplication(mongo::repl::OplogBuffer*, mongo::repl::ReplicationCoordinator*) ()
      ```
       
      so i guess there is a bug here, but i don't find what's the root cause of the bug.

        1. ps1
          5.21 MB
          FirstName lipengchong
        2. ps2
          5.23 MB
          FirstName lipengchong

            Assignee:
            dmitry.agranat@mongodb.com Dmitry Agranat
            Reporter:
            lpc FirstName lipengchong
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: