[IMPALA-4645] Fragments close after query is unregistered - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Not A Problem
Affects Version/s: Impala 2.8.0
Fix Version/s: None
Component/s: Backend
Labels:
- query-lifecycle

Target Version:

Product Backlog

Description

Testing the latest Kudu master exposes an Impala issue where a query appears to complete successfully even though a fragment continues to run. This can be observed in TestKuduMemLimits::test_low_mem_limit_low_selectivity_scan when using the latest version of Kudu in the minicluster (IMPALA_TOOLCHAIN_BUILD_ID=308-96a4cc516e and IMPALA_KUDU_VERSION=e018a83 ). The test completes and the test table is deleted yet a fragment remains afterwards. This then breaks the next test that runs (TestFragmentLifecycle) which expects no fragments to be 'in flight'. See ~~IMPALA-4642~~ for that test failure.

This JIRA is not about Kudu changes or the test failure (tracked by ~~IMPALA-4642~~), but that fragment instances are closed after sending the 'done' report (which means the query may unregister first).

In the repro, after the test completed, no queries were reported running but 2 fragments are still in flight. The impalad web UI shows "impala-server.num-fragments-in-flight: 2"

I captured the stacks on the coordinator using gdb and see 2 fragment execution threads, e.g.:

Thread 10 (Thread 0x7f27af541700 (LWP 14882)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x0000000001185b9b in boost::condition_variable::wait (this=0x12fbee58, m=...) at /home/mj/dev/Impala/toolchain/boost-1.57.0/include/boost/thread/pthread/condition_variable.hpp:73
#2  0x0000000001a55a0c in boost::thread::join_noexcept() ()
#3  0x00000000013316c1 in boost::thread::join (this=0x131214f0) at /home/mj/dev/Impala/toolchain/boost-1.57.0/include/boost/thread/detail/thread.hpp:767
#4  0x0000000001331d2e in impala::Thread::Join (this=0x11a6bd80) at /home/mj/dev/Impala/be/src/util/thread.h:104
#5  0x00000000015eecc4 in impala::ThreadGroup::JoinAll (this=0xc2516a0) at /home/mj/dev/Impala/be/src/util/thread.cc:327
#6  0x000000000178df9c in impala::KuduScanNode::Close (this=0xc251440, state=0xcdd8e00) at /home/mj/dev/Impala/be/src/exec/kudu-scan-node.cc:209
#7  0x00000000019e8e3d in impala::PlanFragmentExecutor::Close (this=0xa76c2b0) at /home/mj/dev/Impala/be/src/runtime/plan-fragment-executor.cc:502
#8  0x00000000019e08fd in impala::FragmentInstanceState::Exec (this=0xa76c000) at /home/mj/dev/Impala/be/src/runtime/fragment-instance-state.cc:69
#9  0x00000000019ebfd3 in impala::QueryExecMgr::ExecFInstance (this=0xad09140, fis=0xa76c000) at /home/mj/dev/Impala/be/src/runtime/query-exec-mgr.cc:100
#10 0x00000000019ef51e in boost::_mfi::mf1<void, impala::QueryExecMgr, impala::FragmentInstanceState*>::operator() (this=0xb71c5e0, p=0xad09140, a1=0xa76c000) at /home/mj/dev/Impala/toolchain/boost-1.57.0/include/boost/bind/mem_fn_template.hpp:165
#11 0x00000000019ef3c1 in boost::_bi::list2<boost::_bi::value<impala::QueryExecMgr*>, boost::_bi::value<impala::FragmentInstanceState*> >::operator()<boost::_mfi::mf1<void, impala::QueryExecMgr, impala::FragmentInstanceState*>, boost::_bi::list0> (this=0xb71c5f0, f=..., a=...) at /home/mj/dev/Impala/toolchain/boost-1.57.0/include/boost/bind/bind.hpp:313
...

There are no coordinator threads as the query has completed. The log for query id e4100828b01f01c:5a7c79e200000000 shows this as well.

I1212 11:10:26.125466  2681 impala-server.cc:907] UnregisterQuery(): query_id=e4100828b01f01c:5a7c79e200000000
I1212 11:10:26.125478  2681 impala-server.cc:993] Cancel(): query_id=e4100828b01f01c:5a7c79e200000000
I1212 11:10:26.125489  2681 coordinator.cc:1362] Cancel() query_id=e4100828b01f01c:5a7c79e200000000
I1212 11:10:26.125504  2681 coordinator.cc:1428] CancelFragmentInstances() query_id=e4100828b01f01c:5a7c79e200000000, tried to cancel 0 fragment instances
I1212 11:10:26.126605  2681 admission-controller.cc:538] Released query id=e4100828b01f01c:5a7c79e200000000 agg_num_running=1, agg_num_queued=0, agg_mem_reserved=32.63 MB,  local_host(local_mem_admitted=3.75 GB, num_admitted_running=1, num_queued=0, backend_mem_reserved=10.21 MB)
I1212 11:10:26.126647  2681 query-exec-mgr.cc:138] ReleaseQueryState(): query_id=e4100828b01f01c:5a7c79e200000000 refcnt=2
I1212 11:10:26.131104  2681 impala-hs2-server.cc:373] CloseSession(): request=TCloseSessionReq {...}
I1212 11:10:26.185884  2681 status.cc:114] Session closed

It's not until almost a minute later that a fragment instance from this query completes:

I1212 11:11:03.275807 14882 query-exec-mgr.cc:103] Instance completed. instance_id=e4100828b01f01c:5a7c79e200000003
I1212 11:11:03.275813 14882 query-exec-mgr.cc:138] ReleaseQueryState(): query_id=e4100828b01f01c:5a7c79e200000000 refcnt=1

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

impalad-fragment-not-close.log
12/Dec/16 19:45
2.35 MB
Matthew Jacobs
stacks.out
12/Dec/16 19:45
1.06 MB
Matthew Jacobs

Issue Links

relates to

IMPALA-4925 Coordinator does not cancel fragments if query completes w/limit

Resolved

Fragments close after query is unregistered

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates