Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4645

Fragments close after query is unregistered

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • Impala 2.8.0
    • None
    • Backend

    Description

      Testing the latest Kudu master exposes an Impala issue where a query appears to complete successfully even though a fragment continues to run. This can be observed in TestKuduMemLimits::test_low_mem_limit_low_selectivity_scan when using the latest version of Kudu in the minicluster (IMPALA_TOOLCHAIN_BUILD_ID=308-96a4cc516e and IMPALA_KUDU_VERSION=e018a83 ). The test completes and the test table is deleted yet a fragment remains afterwards. This then breaks the next test that runs (TestFragmentLifecycle) which expects no fragments to be 'in flight'. See IMPALA-4642 for that test failure.

      This JIRA is not about Kudu changes or the test failure (tracked by IMPALA-4642), but that fragment instances are closed after sending the 'done' report (which means the query may unregister first).

      In the repro, after the test completed, no queries were reported running but 2 fragments are still in flight. The impalad web UI shows "impala-server.num-fragments-in-flight: 2"

      I captured the stacks on the coordinator using gdb and see 2 fragment execution threads, e.g.:

      Thread 10 (Thread 0x7f27af541700 (LWP 14882)):
      #0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
      #1  0x0000000001185b9b in boost::condition_variable::wait (this=0x12fbee58, m=...) at /home/mj/dev/Impala/toolchain/boost-1.57.0/include/boost/thread/pthread/condition_variable.hpp:73
      #2  0x0000000001a55a0c in boost::thread::join_noexcept() ()
      #3  0x00000000013316c1 in boost::thread::join (this=0x131214f0) at /home/mj/dev/Impala/toolchain/boost-1.57.0/include/boost/thread/detail/thread.hpp:767
      #4  0x0000000001331d2e in impala::Thread::Join (this=0x11a6bd80) at /home/mj/dev/Impala/be/src/util/thread.h:104
      #5  0x00000000015eecc4 in impala::ThreadGroup::JoinAll (this=0xc2516a0) at /home/mj/dev/Impala/be/src/util/thread.cc:327
      #6  0x000000000178df9c in impala::KuduScanNode::Close (this=0xc251440, state=0xcdd8e00) at /home/mj/dev/Impala/be/src/exec/kudu-scan-node.cc:209
      #7  0x00000000019e8e3d in impala::PlanFragmentExecutor::Close (this=0xa76c2b0) at /home/mj/dev/Impala/be/src/runtime/plan-fragment-executor.cc:502
      #8  0x00000000019e08fd in impala::FragmentInstanceState::Exec (this=0xa76c000) at /home/mj/dev/Impala/be/src/runtime/fragment-instance-state.cc:69
      #9  0x00000000019ebfd3 in impala::QueryExecMgr::ExecFInstance (this=0xad09140, fis=0xa76c000) at /home/mj/dev/Impala/be/src/runtime/query-exec-mgr.cc:100
      #10 0x00000000019ef51e in boost::_mfi::mf1<void, impala::QueryExecMgr, impala::FragmentInstanceState*>::operator() (this=0xb71c5e0, p=0xad09140, a1=0xa76c000) at /home/mj/dev/Impala/toolchain/boost-1.57.0/include/boost/bind/mem_fn_template.hpp:165
      #11 0x00000000019ef3c1 in boost::_bi::list2<boost::_bi::value<impala::QueryExecMgr*>, boost::_bi::value<impala::FragmentInstanceState*> >::operator()<boost::_mfi::mf1<void, impala::QueryExecMgr, impala::FragmentInstanceState*>, boost::_bi::list0> (this=0xb71c5f0, f=..., a=...) at /home/mj/dev/Impala/toolchain/boost-1.57.0/include/boost/bind/bind.hpp:313
      ...
      

      There are no coordinator threads as the query has completed. The log for query id e4100828b01f01c:5a7c79e200000000 shows this as well.

      I1212 11:10:26.125466  2681 impala-server.cc:907] UnregisterQuery(): query_id=e4100828b01f01c:5a7c79e200000000
      I1212 11:10:26.125478  2681 impala-server.cc:993] Cancel(): query_id=e4100828b01f01c:5a7c79e200000000
      I1212 11:10:26.125489  2681 coordinator.cc:1362] Cancel() query_id=e4100828b01f01c:5a7c79e200000000
      I1212 11:10:26.125504  2681 coordinator.cc:1428] CancelFragmentInstances() query_id=e4100828b01f01c:5a7c79e200000000, tried to cancel 0 fragment instances
      I1212 11:10:26.126605  2681 admission-controller.cc:538] Released query id=e4100828b01f01c:5a7c79e200000000 agg_num_running=1, agg_num_queued=0, agg_mem_reserved=32.63 MB,  local_host(local_mem_admitted=3.75 GB, num_admitted_running=1, num_queued=0, backend_mem_reserved=10.21 MB)
      I1212 11:10:26.126647  2681 query-exec-mgr.cc:138] ReleaseQueryState(): query_id=e4100828b01f01c:5a7c79e200000000 refcnt=2
      I1212 11:10:26.131104  2681 impala-hs2-server.cc:373] CloseSession(): request=TCloseSessionReq {...}
      I1212 11:10:26.185884  2681 status.cc:114] Session closed
      

      It's not until almost a minute later that a fragment instance from this query completes:

      I1212 11:11:03.275807 14882 query-exec-mgr.cc:103] Instance completed. instance_id=e4100828b01f01c:5a7c79e200000003
      I1212 11:11:03.275813 14882 query-exec-mgr.cc:138] ReleaseQueryState(): query_id=e4100828b01f01c:5a7c79e200000000 refcnt=1
      

      Attachments

        1. impalad-fragment-not-close.log
          2.35 MB
          Matthew Jacobs
        2. stacks.out
          1.06 MB
          Matthew Jacobs

        Issue Links

          Activity

            People

              Unassigned Unassigned
              mjacobs Matthew Jacobs
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: