Impalad’s webserver would hang sometimes.
The following is one of the cases: the webserver threads stuck in getting a lock of QueryExecStatus, but I can't find where the lock is acquired in the stack. The web requests are sent from the agent of CDH, which is to check the activity of impalad.
Full gdb log is in the attachment.
Thread 116 (Thread 0x7f288f5e1700 (LWP 31062)):
#0 0x000000378780e334 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00000037878095d8 in _L_lock_854 () from /lib64/libpthread.so.0
#2 0x00000037878094a7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00000000008d6eb8 in pthread_mutex_lock (this=0xcab4f50) at /data/impala/toolchain/boost-1.57.0/include/boost/thread/pthread/mutex.hpp:62
#4 boost::mutex::lock (this=0xcab4f50) at /data/impala/toolchain/boost-1.57.0/include/boost/thread/pthread/mutex.hpp:116
#5 0x0000000000b7903c in lock_guard (this=0xa7b5800, query_id=) at /data/impala/toolchain/boost-1.57.0/include/boost/thread/lock_guard.hpp:38
#6 impala::ImpalaServer::GetRuntimeProfileStr (this=0xa7b5800, query_id=) at /data/impala/be/src/service/impala-server.cc:573
#7 0x0000000000ba6a8c in impala::ImpalaHttpHandler::QueryProfileEncodedHandler (this=0x3f56be0, args=) at /data/impala/be/src/service/impala-http-handler.cc:219
#8 0x0000000000cafe75 in operator() (this=) at /data/impala/toolchain/boost-1.57.0/include/boost/function/function_template.hpp:767
#9 impala::Webserver::RenderUrlWithTemplate (this=) at /data/impala/be/src/util/webserver.cc:443
#10 0x0000000000cb1295 in impala::Webserver::BeginRequestCallback (this=) at /data/impala/be/src/util/webserver.cc:414
#11 0x0000000000cc4850 in handle_request ()
#12 0x0000000000cc6fcd in process_new_connection ()
#13 0x0000000000cc765d in worker_thread ()
#14 0x0000003787807aa1 in start_thread () from /lib64/libpthread.so.0
#15 0x00000037874e8bcd in clone () from /lib64/libc.so.6
The hang situation appears on impala 2.8.0, but I found that code of be/service part hasn’t changed much from 2.8.0 to 2.11.0. so the problem may still exists.
Hope you experts can give me some guidance of finding the root cause, or workaround plans to deal with these hang situation.