Description
In a stress test on a cluster, one of the tablet servers got stuck in a deadlock. It appears that:
- the Raft notification threadpool for a tablet has 24 max threads (corresponding to the number of cores)
- One of the threads is in:
#1 0x00000000019019b2 in kudu::Semaphore::Acquire() () #2 0x0000000000985159 in kudu::consensus::Peer::Close() () #3 0x000000000099d909 in kudu::consensus::PeerManager::Close() () #4 0x00000000009684bd in kudu::consensus::RaftConsensus::RefreshConsensusQueueAndPeersUnlocked() () #5 0x000000000096eced in kudu::consensus::RaftConsensus::ReplicateConfigChangeUnlocked(kudu::consensus::RaftConfigPB const&, kudu::consensus::RaftConfigPB const&, kudu::Callback<void ()(kudu::Status const&)> const&) () #6 0x00000000009795be in kudu::consensus::RaftConsensus::ChangeConfig(kudu::consensus::ChangeConfigRequestPB const&, kudu::Callback<void ()(kudu::Status const&)> const&, boost::optional<kudu::tserver::TabletServerErrorPB_Code>*) () #7 0x0000000000978cd0 in kudu::consensus::RaftConsensus::TryRemoveFollowerTask(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, kudu::consensus::RaftConfigPB const&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
the rest are in:
#1 0x0000000001924e87 in base::SpinLock::SlowLock() () #2 0x000000000097df28 in kudu::consensus::ReplicaState::LockForConfigChange(std::unique_lock<kudu::simple_spinlock>*) const () #3 0x00000000009791dd in kudu::consensus::RaftConsensus::ChangeConfig(kudu::consensus::ChangeConfigRequestPB const&, kudu::Callback<void ()(kudu::Status const&)> const&, boost::optional<kudu::tserver::TabletServerErrorPB_Code>*) () #4 0x0000000000978cd0 in kudu::consensus::RaftConsensus::TryRemoveFollowerTask(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, kudu::consensus::RaftConfigPB const&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
It appears that the thread holding the lock is waiting on a peer response (in order to close the peer), but the peer response is waiting in the ThreadPool's queue (and will never arrive since all threads are occupied waiting on something waiting for it)
Attachments
Attachments
Issue Links
- relates to
-
KUDU-699 PeerManager::Close shouldn't block on requests
-
- Resolved
-