Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Not A Problem
-
Impala 2.8.0
-
None
Description
Testing the latest Kudu master exposes an Impala issue where a query appears to complete successfully even though a fragment continues to run. This can be observed in TestKuduMemLimits::test_low_mem_limit_low_selectivity_scan when using the latest version of Kudu in the minicluster (IMPALA_TOOLCHAIN_BUILD_ID=308-96a4cc516e and IMPALA_KUDU_VERSION=e018a83 ). The test completes and the test table is deleted yet a fragment remains afterwards. This then breaks the next test that runs (TestFragmentLifecycle) which expects no fragments to be 'in flight'. See IMPALA-4642 for that test failure.
This JIRA is not about Kudu changes or the test failure (tracked by IMPALA-4642), but that fragment instances are closed after sending the 'done' report (which means the query may unregister first).
In the repro, after the test completed, no queries were reported running but 2 fragments are still in flight. The impalad web UI shows "impala-server.num-fragments-in-flight: 2"
I captured the stacks on the coordinator using gdb and see 2 fragment execution threads, e.g.:
Thread 10 (Thread 0x7f27af541700 (LWP 14882)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 #1 0x0000000001185b9b in boost::condition_variable::wait (this=0x12fbee58, m=...) at /home/mj/dev/Impala/toolchain/boost-1.57.0/include/boost/thread/pthread/condition_variable.hpp:73 #2 0x0000000001a55a0c in boost::thread::join_noexcept() () #3 0x00000000013316c1 in boost::thread::join (this=0x131214f0) at /home/mj/dev/Impala/toolchain/boost-1.57.0/include/boost/thread/detail/thread.hpp:767 #4 0x0000000001331d2e in impala::Thread::Join (this=0x11a6bd80) at /home/mj/dev/Impala/be/src/util/thread.h:104 #5 0x00000000015eecc4 in impala::ThreadGroup::JoinAll (this=0xc2516a0) at /home/mj/dev/Impala/be/src/util/thread.cc:327 #6 0x000000000178df9c in impala::KuduScanNode::Close (this=0xc251440, state=0xcdd8e00) at /home/mj/dev/Impala/be/src/exec/kudu-scan-node.cc:209 #7 0x00000000019e8e3d in impala::PlanFragmentExecutor::Close (this=0xa76c2b0) at /home/mj/dev/Impala/be/src/runtime/plan-fragment-executor.cc:502 #8 0x00000000019e08fd in impala::FragmentInstanceState::Exec (this=0xa76c000) at /home/mj/dev/Impala/be/src/runtime/fragment-instance-state.cc:69 #9 0x00000000019ebfd3 in impala::QueryExecMgr::ExecFInstance (this=0xad09140, fis=0xa76c000) at /home/mj/dev/Impala/be/src/runtime/query-exec-mgr.cc:100 #10 0x00000000019ef51e in boost::_mfi::mf1<void, impala::QueryExecMgr, impala::FragmentInstanceState*>::operator() (this=0xb71c5e0, p=0xad09140, a1=0xa76c000) at /home/mj/dev/Impala/toolchain/boost-1.57.0/include/boost/bind/mem_fn_template.hpp:165 #11 0x00000000019ef3c1 in boost::_bi::list2<boost::_bi::value<impala::QueryExecMgr*>, boost::_bi::value<impala::FragmentInstanceState*> >::operator()<boost::_mfi::mf1<void, impala::QueryExecMgr, impala::FragmentInstanceState*>, boost::_bi::list0> (this=0xb71c5f0, f=..., a=...) at /home/mj/dev/Impala/toolchain/boost-1.57.0/include/boost/bind/bind.hpp:313 ...
There are no coordinator threads as the query has completed. The log for query id e4100828b01f01c:5a7c79e200000000 shows this as well.
I1212 11:10:26.125466 2681 impala-server.cc:907] UnregisterQuery(): query_id=e4100828b01f01c:5a7c79e200000000 I1212 11:10:26.125478 2681 impala-server.cc:993] Cancel(): query_id=e4100828b01f01c:5a7c79e200000000 I1212 11:10:26.125489 2681 coordinator.cc:1362] Cancel() query_id=e4100828b01f01c:5a7c79e200000000 I1212 11:10:26.125504 2681 coordinator.cc:1428] CancelFragmentInstances() query_id=e4100828b01f01c:5a7c79e200000000, tried to cancel 0 fragment instances I1212 11:10:26.126605 2681 admission-controller.cc:538] Released query id=e4100828b01f01c:5a7c79e200000000 agg_num_running=1, agg_num_queued=0, agg_mem_reserved=32.63 MB, local_host(local_mem_admitted=3.75 GB, num_admitted_running=1, num_queued=0, backend_mem_reserved=10.21 MB) I1212 11:10:26.126647 2681 query-exec-mgr.cc:138] ReleaseQueryState(): query_id=e4100828b01f01c:5a7c79e200000000 refcnt=2 I1212 11:10:26.131104 2681 impala-hs2-server.cc:373] CloseSession(): request=TCloseSessionReq {...} I1212 11:10:26.185884 2681 status.cc:114] Session closed
It's not until almost a minute later that a fragment instance from this query completes:
I1212 11:11:03.275807 14882 query-exec-mgr.cc:103] Instance completed. instance_id=e4100828b01f01c:5a7c79e200000003 I1212 11:11:03.275813 14882 query-exec-mgr.cc:138] ReleaseQueryState(): query_id=e4100828b01f01c:5a7c79e200000000 refcnt=1
Attachments
Attachments
Issue Links
- relates to
-
IMPALA-4925 Coordinator does not cancel fragments if query completes w/limit
- Resolved