Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-3875

Thrift threaded server hang in some cases

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: Impala 2.6.0
    • Fix Version/s: Impala 2.8.0
    • Component/s: Distributed Exec
    • Labels:
      None

      Description

      Hanging looks like this:

      #0  0x000000398340e82d in read () from 05r/lib64/libpthread.so.0
      #1  0x00000039870dea71 in ?? () from 05r/usr/lib64/libcrypto.so.10
      #2  0x00000039870dcdc9 in BIO_read () from 05r/usr/lib64/libcrypto.so.10
      #3  0x0000003989431873 in ssl23_read_bytes () from 05r/usr/lib64/libssl.so.10
      #4  0x000000398942fe63 in ssl23_get_client_hello () from 05r/usr/lib64/libssl.so.10
      #5  0x00000039894302f3 in ssl23_accept () from 05r/usr/lib64/libssl.so.10
      #6  0x00000000015ee4bc in apache::thrift::transport::TSSLSocket::checkHandshake (this=0xf317b00) at src/thrift/transport/TSSLSocket.cpp:228
      #7  0x00000000015ee820 in apache::thrift::transport::TSSLSocket::read (this=0xf317b00, buf=0x7f8a9ea750a0 "@S\247\236\212\177", len=5) at src/thrift/transport/TSSLSocket.cpp:164
      #8  0x00000000015ebc4f in apache::thrift::transport::readAll<apache::thrift::transport::TSocket> (trans=..., buf=0x7f8a9ea750a0 "@S\247\236\212\177", len=5) at src/thrift/transport/TTransport.h:39
      #9  0x0000000000a80228 in apache::thrift::transport::TTransport::readAll (len=5, buf=0x7f8a9ea750a0 "@S\247\236\212\177", this=<optimized out>) at /usr/src/debug/impala-2.3.0-cdh5.5.2/thirdparty/thrift-0.9.0/build/include/thrift/transport/TTransport.h:126
      #10 apache::thrift::transport::TSaslTransport::receiveSaslMessage (this=0xb6a0770, status=0x7f8a9ea752e4, length=0x7f8a9ea752e8) at /usr/src/debug/impala-2.3.0-cdh5.5.2/be/src/transport/TSaslTransport.cpp:237
      #11 0x0000000000a7dc84 in apache::thrift::transport::TSaslServerTransport::handleSaslStartMessage (this=0xb6a0770) at /usr/src/debug/impala-2.3.0-cdh5.5.2/be/src/transport/TSaslServerTransport.cpp:80
      #12 0x0000000000a8075e in apache::thrift::transport::TSaslTransport::open (this=0xb6a0770) at /usr/src/debug/impala-2.3.0-cdh5.5.2/be/src/transport/TSaslTransport.cpp:95
      #13 0x0000000000a7e9c1 in apache::thrift::transport::TSaslServerTransport::Factory::getTransport (this=0xd0edcb0, trans=...) at /usr/src/debug/impala-2.3.0-cdh5.5.2/be/src/transport/TSaslServerTransport.cpp:145
      #14 0x00000000015f6f78 in apache::thrift::server::TThreadedServer::serve (this=0xc181420) at src/thrift/server/TThreadedServer.cpp:162
      #15 0x000000000095149c in impala::ThriftServer::ThriftServerEventProcessor::Supervise (this=<optimized out>) at /usr/src/debug/impala-2.3.0-cdh5.5.2/be/src/rpc/thrift-server.cc:173
      #16 0x0000000000ae0faa in boost::function0<void>::operator() (this=<optimized out>) at /opt/toolchain/boost-pic-1.55.0/include/boost/function/function_template.hpp:767
      #17 impala::Thread::SuperviseThread(std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>*) (name=..., category=..., functor=..., thread_started=0x7fff9af4ca60) at /usr/src/debug/impala-2.3.0-cdh5.5.2/be/src/util/thread.cc:314
      #18 0x0000000000ae3250 in boost::_bi::list4<boost::_bi::value<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::_bi::value<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::_bi::value<boost::function<void()> >, boost::_bi::value<impala::Promise<long int>*> >::operator()<void (*)(const std::string&, const std::string&, impala::Thread::ThreadFunctor, impala::Promise<long int>*), boost::_bi::list0> (a=...,
          f=@0xc3747b8: 0xae0df0 <impala::Thread::SuperviseThread(std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>*)>, this=0xc3747c0) at /opt/toolchain/boost-pic-1.55.0/include/boost/bind/bind.hpp:457
      #19 boost::_bi::bind_t<void, void (*)(std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>*), boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, boost::_bi::value<impala::Promise<long>*> > >::operator()() (this=0xc3747b8) at /opt/toolchain/boost-pic-1.55.0/include/boost/bind/bind_template.hpp:20
      #20 boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>*), boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, boost::_bi::value<impala::Promise<long>*> > > >::run() (this=0xc374600) at /opt/toolchain/boost-pic-1.55.0/include/boost/thread/detail/thread.hpp:117
      #21 0x0000000000d28c43 in ?? ()
      #22 0x0000003983407aa1 in start_thread () from 05r/lib64/libpthread.so.0
      #23 0x00000039830e893d in clone () from 05r/lib64/libc.so.6
      

      This is very very bad that the whole threaded server thread will hang because it never gets a chance to dispatch the new serving thread by thread->start();

      This impalad becomes zombie..

      From http://github.mtv.cloudera.com/CDH/Impala/blob/cdh5-trunk/be/src/runtime/client-cache.cc#L106-L113
      we should probably set socket timeout before OpenWithRetry().

        Activity

        Show
        sailesh Sailesh Mukil added a comment - Commit in: https://github.com/apache/incubator-impala/commit/ff629c2deb97b4ef25e80745bf4689dcbe8407fe
        Hide
        aivanov_impala_e71b Antoni added a comment -

        @Sailesh

        Hi,

        We've encountered recently similar hangs in Impala. Looking at the stacktrace it's similar to the one posted in the bug.
        So I was wondering do you know under what circumstances can this happen ?

        Thanks

        Show
        aivanov_impala_e71b Antoni added a comment - @Sailesh Hi, We've encountered recently similar hangs in Impala. Looking at the stacktrace it's similar to the one posted in the bug. So I was wondering do you know under what circumstances can this happen ? Thanks
        Hide
        sailesh Sailesh Mukil added a comment -

        Antoni Although rare, there are a few reasons why they may happen:

        • Very slow progress on the peer, causing this node to stall as well.
        • Peer itself is hung due to some other bug, causing it not to respond to the node in question.

        Most of the hangs have been fixed in newer versions of Impala. However, if you wish to ensure progress (vs. a hang) on an older version of Impala, back porting the above patch should at least ensure graceful failures.

        The commit message in the patch briefly explains how it would help.

        Show
        sailesh Sailesh Mukil added a comment - Antoni Although rare, there are a few reasons why they may happen: Very slow progress on the peer, causing this node to stall as well. Peer itself is hung due to some other bug, causing it not to respond to the node in question. Most of the hangs have been fixed in newer versions of Impala. However, if you wish to ensure progress (vs. a hang) on an older version of Impala, back porting the above patch should at least ensure graceful failures. The commit message in the patch briefly explains how it would help.
        Hide
        tozka Antoni Ivanov added a comment -

        @Sailesh
        Thanks,

        Btw I failed to mention that it seems only the "frontend" part hangs - basically I cannot submit queries against the node, or open Web UI (port 25000) or start impala-shell (it hangs indefinitely)
        But looking at the logs there seem to be fragments getting executed. Is this possible with this bug?

        Show
        tozka Antoni Ivanov added a comment - @Sailesh Thanks, Btw I failed to mention that it seems only the "frontend" part hangs - basically I cannot submit queries against the node, or open Web UI (port 25000) or start impala-shell (it hangs indefinitely) But looking at the logs there seem to be fragments getting executed. Is this possible with this bug?
        Hide
        sailesh Sailesh Mukil added a comment -

        Antoni Yes, we have seen before that this is possible if all the client connection handlers are hung. This will still allow internal communication since the backend threads are progressing, but the client handlers (threads) can make no progress due to the hang.

        You can have a look at the following numbers from the WebUI (if it's accessible):

        "impala.thrift-server.beeswax-frontend.connections-in-use"
        "impala.thrift-server.hiveserver2-frontend.connections-in-use"

        By default, we allow only 64 threads to service client requests per impalad.

        They can be increased/decreased by modifying the impalad startup flag "fe_service_threads".

        Show
        sailesh Sailesh Mukil added a comment - Antoni Yes, we have seen before that this is possible if all the client connection handlers are hung. This will still allow internal communication since the backend threads are progressing, but the client handlers (threads) can make no progress due to the hang. You can have a look at the following numbers from the WebUI (if it's accessible): "impala.thrift-server.beeswax-frontend.connections-in-use" "impala.thrift-server.hiveserver2-frontend.connections-in-use" By default, we allow only 64 threads to service client requests per impalad. They can be increased/decreased by modifying the impalad startup flag "fe_service_threads".
        Hide
        tozka Antoni Ivanov added a comment -

        @Sailesh
        Thanks again,

        Looking at our telemetry (we collect regularly everything under node:25000?jsonmetrics?json)
        it seems that the impala.thrift-server.beeswax-frontend.connections-in-use was at "3" before it stopped reporting (because the Web UI becomes unresponsive) but the impala.thrift-server.beeswax-frontend.total-connections is 3500 (almost 3 times more than the next node).

        Also on that node we execute impala-shell queries a lot though usually one after another.

        Is it possible that impala-shell is leaking connections and not closing them properly

        Show
        tozka Antoni Ivanov added a comment - @Sailesh Thanks again, Looking at our telemetry (we collect regularly everything under node:25000?jsonmetrics?json) it seems that the impala.thrift-server.beeswax-frontend.connections-in-use was at "3" before it stopped reporting (because the Web UI becomes unresponsive) but the impala.thrift-server.beeswax-frontend.total-connections is 3500 (almost 3 times more than the next node). Also on that node we execute impala-shell queries a lot though usually one after another. Is it possible that impala-shell is leaking connections and not closing them properly
        Hide
        sailesh Sailesh Mukil added a comment -

        Antoni Ivanov I doubt that the shell would leak connections, since we haven't seen that happen before. A possibility is that the number of connections-in-use spiked after the last telemetry report?

        If you execute more queries on that node, then it makes sense that the "..total-connections" counter is higher (total-connections is total number of connections made from the time the impalad started).

        Show
        sailesh Sailesh Mukil added a comment - Antoni Ivanov I doubt that the shell would leak connections, since we haven't seen that happen before. A possibility is that the number of connections-in-use spiked after the last telemetry report? If you execute more queries on that node, then it makes sense that the "..total-connections" counter is higher (total-connections is total number of connections made from the time the impalad started).
        Hide
        tozka Antoni Ivanov added a comment -

        Sailesh Mukil,

        running lsof to list all open network files by impalad process
        lsof -P -n -i -a -p impalad_pid | grep ':25000' | wc -l
        This shows 72 (on all other nodes it's just 1 the 25000 listen port)

        They are all in state CLOSE_WAIT e.g:

        impalad 4846 impala 996u IPv4 548599405 0t0 TCP 127.0.0.1:25000->127.0.0.1:36856 (CLOSE_WAIT)
        impalad 4846 impala 997u IPv4 548599466 0t0 TCP 127.0.0.1:25000->127.0.0.1:36862 (CLOSE_WAIT)

        There appears to be no process that belongs to that other port netstat -tulpn | grep 36862

        Thanks again!

        Show
        tozka Antoni Ivanov added a comment - Sailesh Mukil, running lsof to list all open network files by impalad process lsof -P -n -i -a -p impalad_pid | grep ':25000' | wc -l This shows 72 (on all other nodes it's just 1 the 25000 listen port) They are all in state CLOSE_WAIT e.g: impalad 4846 impala 996u IPv4 548599405 0t0 TCP 127.0.0.1:25000->127.0.0.1:36856 (CLOSE_WAIT) impalad 4846 impala 997u IPv4 548599466 0t0 TCP 127.0.0.1:25000->127.0.0.1:36862 (CLOSE_WAIT) There appears to be no process that belongs to that other port netstat -tulpn | grep 36862 Thanks again!
        Hide
        tozka Antoni Ivanov added a comment -

        Right sorry, of course there wouldn't be other process.
        So it appears that although it's closed the Impala is still holding the connection open, which is perhaps this bug ?

        Show
        tozka Antoni Ivanov added a comment - Right sorry, of course there wouldn't be other process. So it appears that although it's closed the Impala is still holding the connection open, which is perhaps this bug ?
        Hide
        tozka Antoni Ivanov added a comment -

        Hi,

        Upon closer inspection (I compared the stack traces of the good nodes vs the bad one) it appears that the hanging threads are (see bottom) which seems to be https://issues.apache.org/jira/browse/IMPALA-2799

        #0 0x00007ff9d775d264 in __lll_lock_wait () from /lib64/libpthread.so.0
        #1 0x00007ff9d7758508 in _L_lock_854 () from /lib64/libpthread.so.0
        #2 0x00007ff9d77583d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
        #3 0x000000000082e778 in boost::mutex::lock() ()
        #4 0x0000000000af879c in impala::ImpalaHttpHandler::InflightQueryIdsHandler(std::map<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&, rapidjson::GenericDocument<rapidjson::UTF8<char>, rapidjson::MemoryPoolAllocator<rapidjson::CrtAllocator> >*) ()
        #5 0x0000000000bfd045 in impala::Webserver::RenderUrlWithTemplate(std::map<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&, impala::Webserver::UrlHandler const&, std::basic_stringstream<char, std::char_traits<char>, std::allocator<char> >, impala::ContentType) ()
        #6 0x0000000000bfe465 in impala::Webserver::BeginRequestCallback(sq_connection*, sq_request_info*) ()
        #7 0x0000000000c119b0 in ?? ()
        #8 0x0000000000c1412d in ?? ()
        #9 0x0000000000c147bd in ?? ()
        #10 0x00007ff9d77569d1 in start_thread () from /lib64/libpthread.so.0
        #11 0x00007ff9d74a38fd in clone () from /lib64/libc.so.6

        Show
        tozka Antoni Ivanov added a comment - Hi, Upon closer inspection (I compared the stack traces of the good nodes vs the bad one) it appears that the hanging threads are (see bottom) which seems to be https://issues.apache.org/jira/browse/IMPALA-2799 #0 0x00007ff9d775d264 in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007ff9d7758508 in _L_lock_854 () from /lib64/libpthread.so.0 #2 0x00007ff9d77583d7 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x000000000082e778 in boost::mutex::lock() () #4 0x0000000000af879c in impala::ImpalaHttpHandler::InflightQueryIdsHandler(std::map<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&, rapidjson::GenericDocument<rapidjson::UTF8<char>, rapidjson::MemoryPoolAllocator<rapidjson::CrtAllocator> >*) () #5 0x0000000000bfd045 in impala::Webserver::RenderUrlWithTemplate(std::map<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&, impala::Webserver::UrlHandler const&, std::basic_stringstream<char, std::char_traits<char>, std::allocator<char> > , impala::ContentType ) () #6 0x0000000000bfe465 in impala::Webserver::BeginRequestCallback(sq_connection*, sq_request_info*) () #7 0x0000000000c119b0 in ?? () #8 0x0000000000c1412d in ?? () #9 0x0000000000c147bd in ?? () #10 0x00007ff9d77569d1 in start_thread () from /lib64/libpthread.so.0 #11 0x00007ff9d74a38fd in clone () from /lib64/libc.so.6
        Hide
        sailesh Sailesh Mukil added a comment -

        Antoni Ivanov Yes, that seems to point in the direction of our above diagnosis. However, IMPALA-2799 is a duplicate of this. There are some fixes that went in to ensure that this deadlock doesn't happen, which can be found on newer versions of Impala.

        Show
        sailesh Sailesh Mukil added a comment - Antoni Ivanov Yes, that seems to point in the direction of our above diagnosis. However, IMPALA-2799 is a duplicate of this. There are some fixes that went in to ensure that this deadlock doesn't happen, which can be found on newer versions of Impala.
        Hide
        tozka Antoni Ivanov added a comment - - edited

        Thanks for the help.
        Upgrade for the moment is too costly for us to do quickly (since we do follow CDH upgrades which upgrades everything.)

        For reference what we did to managed the issue:

        -We find out processes that are polling impalad port 25000 for statics (impalad:25000/jsonmetrics?json). We have a few agents (monitoring and haproxy agents) and we stopped them. It did occur a few times with Cloudera Manager agent but we didn't stop it but simply restarted Impala since we are not sure what depends on it.

        • We are suspecting configuration change to align better with Hadoop recommendations: sysctl net.core.somaxconn=128 to 1024 (and changing ifconfig eth0 txqueuelen 1000 to 4000) may have worsen things and started causing this issue. But we haven't confirmed.
        Show
        tozka Antoni Ivanov added a comment - - edited Thanks for the help. Upgrade for the moment is too costly for us to do quickly (since we do follow CDH upgrades which upgrades everything.) For reference what we did to managed the issue: -We find out processes that are polling impalad port 25000 for statics (impalad:25000/jsonmetrics?json). We have a few agents (monitoring and haproxy agents) and we stopped them. It did occur a few times with Cloudera Manager agent but we didn't stop it but simply restarted Impala since we are not sure what depends on it. We are suspecting configuration change to align better with Hadoop recommendations: sysctl net.core.somaxconn=128 to 1024 (and changing ifconfig eth0 txqueuelen 1000 to 4000) may have worsen things and started causing this issue. But we haven't confirmed.

          People

          • Assignee:
            sailesh Sailesh Mukil
            Reporter:
            HuaisiXu Huaisi Xu
          • Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development