Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-7464

DCHECK(!released_exec_resources_) is triggered when ExecFInstance() RPC times out

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: Impala 3.1.0
    • Fix Version/s: Impala 3.1.0
    • Component/s: Distributed Exec
    • Labels:
      None

      Description

      Hit the following when running custom_cluster_test/test_rpc_timeout.py

      F0817 20:04:25.451669   311 query-state.cc:469] Check failed: !released_exec_resources_
      
      void QueryState::AcquireExecResourceRefcount() {
        DCHECK(!released_exec_resources_); <<----
        exec_resource_refcnt_.Add(1);
      }
      
      #0  0x00007ff07d59a1f7 in raise () from /lib64/libc.so.6
      #1  0x00007ff07d59b8e8 in abort () from /lib64/libc.so.6
      #2  0x00000000043dd1b4 in google::DumpStackTraceAndExit() ()
      #3  0x00000000043d3c0d in google::LogMessage::Fail() ()
      #4  0x00000000043d54b2 in google::LogMessage::SendToLog() ()
      #5  0x00000000043d35e7 in google::LogMessage::Flush() ()
      #6  0x00000000043d6bae in google::LogMessageFatal::~LogMessageFatal() ()
      #7  0x0000000001e2f623 in impala::QueryState::AcquireExecResourceRefcount (this=0xc5c2800) at /data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/runtime/query-state.cc:469
      #8  0x0000000001e2b734 in impala::QueryState::Init (this=0xc5c2800, rpc_params=...) at /data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/runtime/query-state.cc:125
      #9  0x0000000001e516e1 in impala::QueryExecMgr::StartQuery (this=0xc7c55c0, params=...) at /data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/runtime/query-exec-mgr.cc:52
      #10 0x0000000001f6bd12 in impala::ImpalaInternalService::ExecQueryFInstances (this=0xd4421e0, return_val=..., params=...) at /data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/service/impala-internal-service.cc:51
      #11 0x000000000308615c in impala::ImpalaInternalServiceProcessor::process_ExecQueryFInstances (this=0xcc38580, seqid=0, iprot=0xd3a4900, oprot=0xd3a4980, callContext=0xc2475c0) at /data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/generated-sources/gen-cpp/ImpalaInternalService.cpp:1278
      #12 0x0000000003085eaa in impala::ImpalaInternalServiceProcessor::dispatchCall (this=0xcc38580, iprot=0xd3a4900, oprot=0xd3a4980, fname=..., seqid=0, callContext=0xc2475c0) at /data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/generated-sources/gen-cpp/ImpalaInternalService.cpp:1251
      #13 0x0000000001913fd6 in apache::thrift::TDispatchProcessor::process (this=0xcc38580, in=..., out=..., connectionContext=0xc2475c0) at /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/thrift-0.9.3-p4/include/thrift/TDispatchProcessor.h:121
      #14 0x0000000001d2fbe9 in apache::thrift::server::TAcceptQueueServer::Task::run (this=0xce010c0) at /data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/rpc/TAcceptQueueServer.cpp:68
      #15 0x0000000001d27ca5 in impala::ThriftThread::RunRunnable (this=0xc246500, runnable=..., promise=0x7fefee7b1260) at /data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/rpc/thrift-thread.cc:74
      #16 0x0000000001d293cb in boost::_mfi::mf2<void, impala::ThriftThread, boost::shared_ptr<apache::thrift::concurrency::Runnable>, impala::Promise<unsigned long, (impala::PromiseMode)0>*>::operator() (this=0xcdf1650, p=0xc246500, a1=..., a2=0x7fefee7b1260) at /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.57.0-p3/include/boost/bind/mem_fn_template.hpp:280
      #17 0x0000000001d29261 in boost::_bi::list3<boost::_bi::value<impala::ThriftThread*>, boost::_bi::value<boost::shared_ptr<apache::thrift::concurrency::Runnable> >, boost::_bi::value<impala::Promise<unsigned long, (impala::PromiseMode)0>*> >::operator()<boost::_mfi::mf2<void, impala::ThriftThread, boost::shared_ptr<apache::thrift::concurrency::Runnable>, impala::Promise<unsigned long, (impala::PromiseMode)0>*>, boost::_bi::list0> (this=0xcdf1660, f=..., a=...) at /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.57.0-p3/include/boost/bind/bind.hpp:392
      #18 0x0000000001d28fad in boost::_bi::bind_t<void, boost::_mfi::mf2<void, impala::ThriftThread, boost::shared_ptr<apache::thrift::concurrency::Runnable>, impala::Promise<unsigned long, (impala::PromiseMode)0>*>, boost::_bi::list3<boost::_bi::value<impala::ThriftThread*>, boost::_bi::value<boost::shared_ptr<apache::thrift::concurrency::Runnable> >, boost::_bi::value<impala::Promise<unsigned long, (impala::PromiseMode)0>*> > >::operator() (this=0xcdf1650) at /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.57.0-p3/include/boost/bind/bind_template.hpp:20
      #19 0x0000000001d28ec0 in boost::detail::function::void_function_obj_invoker0<boost::_bi::bind_t<void, boost::_mfi::mf2<void, impala::ThriftThread, boost::shared_ptr<apache::thrift::concurrency::Runnable>, impala::Promise<unsigned long, (impala::PromiseMode)0>*>, boost::_bi::list3<boost::_bi::value<impala::ThriftThread*>, boost::_bi::value<boost::shared_ptr<apache::thrift::concurrency::Runnable> >, boost::_bi::value<impala::Promise<unsigned long, (impala::PromiseMode)0>*> > >, void>::invoke (function_obj_ptr=...) at /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.57.0-p3/include/boost/function/function_template.hpp:153
      #20 0x0000000001c4cbd0 in boost::function0<void>::operator() (this=0x7fefe719cba0) at /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.57.0-p3/include/boost/function/function_template.hpp:767
      #21 0x00000000020754db in impala::Thread::SuperviseThread(std::string const&, std::string const&, boost::function<void ()>, impala::ThreadDebugInfo const*, impala::Promise<long, (impala::PromiseMode)0>*) (name=..., category=..., functor=..., parent_thread_info=0x7fefee7b1850, thread_started=0x7fefee7b0f30) at /data/jenkins/workspace/impala-private-parameterized/repos/Impala/be/src/util/thread.cc:359
      #22 0x000000000207d7fb in boost::_bi::list5<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, boost::_bi::value<impala::ThreadDebugInfo*>, boost::_bi::value<impala::Promise<long, (impala::PromiseMode)0>*> >::operator()<void (*)(std::string const&, std::string const&, boost::function<void ()>, impala::ThreadDebugInfo const*, impala::Promise<long, (impala::PromiseMode)0>*), boost::_bi::list0>(boost::_bi::type<void>, void (*&)(std::string const&, std::string const&, boost::function<void ()>, impala::ThreadDebugInfo const*, impala::Promise<long, (impala::PromiseMode)0>*), boost::_bi::list0&, int) (this=0xd016fc0, f=@0xd016fb8: 0x2075174 <impala::Thread::SuperviseThread(std::string const&, std::string const&, boost::function<void ()>, impala::ThreadDebugInfo const*, impala::Promise<long, (impala::PromiseMode)0>*)>, a=...) at /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.57.0-p3/include/boost/bind/bind.hpp:525
      #23 0x000000000207d71f in boost::_bi::bind_t<void, void (*)(std::string const&, std::string const&, boost::function<void ()>, impala::ThreadDebugInfo const*, impala::Promise<long, (impala::PromiseMode)0>*), boost::_bi::list5<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, boost::_bi::value<impala::ThreadDebugInfo*>, boost::_bi::value<impala::Promise<long, (impala::PromiseMode)0>*> > >::operator()() (this=0xd016fb8) at /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.57.0-p3/include/boost/bind/bind_template.hpp:20
      #24 0x000000000207d6e2 in boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(std::string const&, std::string const&, boost::function<void ()>, impala::ThreadDebugInfo const*, impala::Promise<long, (impala::PromiseMode)0>*), boost::_bi::list5<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, boost::_bi::value<impala::ThreadDebugInfo*>, boost::_bi::value<impala::Promise<long, (impala::PromiseMode)0>*> > > >::run() (this=0xd016e00) at /data/jenkins/workspace/impala-private-parameterized/Impala-Toolchain/boost-1.57.0-p3/include/boost/thread/detail/thread.hpp:116
      #25 0x00000000032ff25a in thread_proxy ()
      #26 0x00007ff07d92fe25 in start_thread () from /lib64/libpthread.so.0
      #27 0x00007ff07d65d34d in clone () from /lib64/libc.so.6
      

      It appears that there is a race on the coordinator which occurs only when ExecFInstance() RPC times out due to really slow processing of the RPC on the coordinator itself. The assumption of the existing code is that by the time exec_rpcs_complete_barrier_ is unblocked, all executors should have processed their RPCs and called QueryState::Init(). This assumption is violated when the executor on coordinator is slow enough to cause ExecFInstance() RPC to time out.

      In which case, the coordinator proceeds to mark the query as failed in Coordinator::FinishBackendStartup() which calls UpdateExecState() which eventually calls ReleaseExecResources() in HandleExecStateTransition(). That's why the DCHECK above is triggered.

      I0817 20:04:23.451918   309 coordinator.cc:508] ExecState: query id=d740459b753125a4:f74e0fd700000000 finstance=N/A on host=impala-ec2-centos74-m5-4xlarge-ondemand-1ebb.vpc.cloudera.com (EXECUTING -> ERROR) status=ExecQueryFInstances rpc query_id=d740459b753125a4:f74e0fd700000000 failed: RPC recv timed out: dest address: impala-ec2-centos74-m5-4xlarge-ondemand-1ebb.vpc.cloudera.com:22000, rpc: N6impala26TExecQueryFInstancesResultE
      

      In release builds, the behavior would be that the executor on coordinator will move ahead to allocate a bunch of resources after ReleaseExecResources() has been called on the coordinator. It will also start a bunch of fragment instances only to realize latter from ReportExecStatus() RPC that the query should be cancelled, in which case, it calls ReleaseExecResource() again after the ref count reaches 0 in ReleaseExecResourceRefCount() so there should be no resource leak. That said, if any logic relies on the flag released_exec_resources_, it may get confused.

        Attachments

          Activity

            People

            • Assignee:
              dhecht Daniel Hecht
              Reporter:
              kwho Michael Ho
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: