Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-1564

Deadlock on raft notification ThreadPool

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 0.10.0
    • 1.3.0
    • consensus
    • None

    Description

      In a stress test on a cluster, one of the tablet servers got stuck in a deadlock. It appears that:

      • the Raft notification threadpool for a tablet has 24 max threads (corresponding to the number of cores)
      • One of the threads is in:
        #1  0x00000000019019b2 in kudu::Semaphore::Acquire() ()
        #2  0x0000000000985159 in kudu::consensus::Peer::Close() ()
        #3  0x000000000099d909 in kudu::consensus::PeerManager::Close() ()
        #4  0x00000000009684bd in kudu::consensus::RaftConsensus::RefreshConsensusQueueAndPeersUnlocked() ()
        #5  0x000000000096eced in kudu::consensus::RaftConsensus::ReplicateConfigChangeUnlocked(kudu::consensus::RaftConfigPB const&, kudu::consensus::RaftConfigPB const&, kudu::Callback<void ()(kudu::Status const&)> const&) ()
        #6  0x00000000009795be in kudu::consensus::RaftConsensus::ChangeConfig(kudu::consensus::ChangeConfigRequestPB const&, kudu::Callback<void ()(kudu::Status const&)> const&, boost::optional<kudu::tserver::TabletServerErrorPB_Code>*) ()
        #7  0x0000000000978cd0 in kudu::consensus::RaftConsensus::TryRemoveFollowerTask(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, kudu::consensus::RaftConfigPB const&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
        

      the rest are in:

      #1  0x0000000001924e87 in base::SpinLock::SlowLock() ()
      #2  0x000000000097df28 in kudu::consensus::ReplicaState::LockForConfigChange(std::unique_lock<kudu::simple_spinlock>*) const ()
      #3  0x00000000009791dd in kudu::consensus::RaftConsensus::ChangeConfig(kudu::consensus::ChangeConfigRequestPB const&, kudu::Callback<void ()(kudu::Status const&)> const&, boost::optional<kudu::tserver::TabletServerErrorPB_Code>*) ()
      #4  0x0000000000978cd0 in kudu::consensus::RaftConsensus::TryRemoveFollowerTask(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, kudu::consensus::RaftConfigPB const&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
      

      It appears that the thread holding the lock is waiting on a peer response (in order to close the peer), but the peer response is waiting in the ThreadPool's queue (and will never arrive since all threads are occupied waiting on something waiting for it)

      Attachments

        1. stacks.txt
          150 kB
          Todd Lipcon

        Issue Links

          Activity

            People

              tlipcon Todd Lipcon
              tlipcon Todd Lipcon
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: