XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • Impala 2.12.0
    • Impala 4.0.0
    • Distributed Exec

    Description

      Logs from a large cluster show that query startup can take a long time, then once the startup completes the query is cancelled, this is because one of the intermediate rpcs failed.

      Not clear what the right answer is as fragments are started asynchronously, possibly a timeout?

      I0401 21:25:30.776803 1830900 coordinator.cc:99] Exec() query_id=334cc7dd9758c36c:ec38aeb400000000 stmt=with customer_total_return as
      I0401 21:25:30.813993 1830900 coordinator.cc:357] starting execution on 644 backends for query_id=334cc7dd9758c36c:ec38aeb400000000
      I0401 21:29:58.406466 1830900 coordinator.cc:370] started execution on 644 backends for query_id=334cc7dd9758c36c:ec38aeb400000000
      I0401 21:29:58.412132 1830900 coordinator.cc:896] Cancel() query_id=334cc7dd9758c36c:ec38aeb400000000
      I0401 21:29:59.188817 1830900 coordinator.cc:906] CancelBackends() query_id=334cc7dd9758c36c:ec38aeb400000000, tried to cancel 643 backends
      I0401 21:29:59.189177 1830900 coordinator.cc:1092] Release admission control resources for query_id=334cc7dd9758c36c:ec38aeb400000000
      
      I0401 21:23:48.218379 1830386 coordinator.cc:99] Exec() query_id=e44d553b04d47cfb:28f06bb800000000 stmt=with customer_total_return as
      I0401 21:23:48.270226 1830386 coordinator.cc:357] starting execution on 640 backends for query_id=e44d553b04d47cfb:28f06bb800000000
      I0401 21:29:58.402195 1830386 coordinator.cc:370] started execution on 640 backends for query_id=e44d553b04d47cfb:28f06bb800000000
      I0401 21:29:58.403818 1830386 coordinator.cc:896] Cancel() query_id=e44d553b04d47cfb:28f06bb800000000
      I0401 21:29:59.255903 1830386 coordinator.cc:906] CancelBackends() query_id=e44d553b04d47cfb:28f06bb800000000, tried to cancel 639 backends
      I0401 21:29:59.256251 1830386 coordinator.cc:1092] Release admission control resources for query_id=e44d553b04d47cfb:28f06bb800000000
      

      Checked the coordinator and threads appear to be spending lots of time waiting on exec_complete_barrier_

      #0  0x00007fd928c816d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
      #1  0x0000000001222944 in impala::Promise<bool>::Get() ()
      #2  0x0000000001220d7b in impala::Coordinator::StartBackendExec() ()
      #3  0x0000000001221c87 in impala::Coordinator::Exec() ()
      #4  0x0000000000c3a925 in impala::ClientRequestState::ExecQueryOrDmlRequest(impala::TQueryExecRequest const&) ()
      #5  0x0000000000c41f7e in impala::ClientRequestState::Exec(impala::TExecRequest*) ()
      #6  0x0000000000bff597 in impala::ImpalaServer::ExecuteInternal(impala::TQueryCtx const&, std::shared_ptr<impala::ImpalaServer::SessionState>, bool*, std::shared_ptr<impala::ClientRequestState>*) ()
      #7  0x0000000000c061d9 in impala::ImpalaServer::Execute(impala::TQueryCtx*, std::shared_ptr<impala::ImpalaServer::SessionState>, std::shared_ptr<impala::ClientRequestState>*) ()
      #8  0x0000000000c561c5 in impala::ImpalaServer::query(beeswax::QueryHandle&, beeswax::Query const&) ()
      /StartBackendExec
      #11 0x0000000000d60c9a in boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(std::string const&, std::string const&, boost::function<void ()>, impala::ThreadDebugInfo const*, impala::Promise<long>*), boost::_bi::list5<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, boost::_bi::value<impala::ThreadDebugInfo*>, boost::_bi::value<impala::Promise<long>*> > > >::run() ()
      

      Attachments

        Issue Links

          Activity

            People

              wzhou Wenzhe Zhou
              mmokhtar Mostafa Mokhtar
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: