Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4180

Crash: impala::DiskIoRequestContext::Cancel

    Details

    • Docs Text:
      Hide
      This change fixes an issue which may crash Impalad for queries with plan fragments containing multiple HDFS scan node. This is more likely to happen if the query option num_nodes is set to 1 or if exec_single_node_rows_threshold is set to a large value which makes it more likely to generate plan fragments which contain multiple HDFS scan nodes.

      Workaround is to set num_nodes to 0 or a number greater than 1 and set exec_single_node_rows_threshold to 0.
      Show
      This change fixes an issue which may crash Impalad for queries with plan fragments containing multiple HDFS scan node. This is more likely to happen if the query option num_nodes is set to 1 or if exec_single_node_rows_threshold is set to a large value which makes it more likely to generate plan fragments which contain multiple HDFS scan nodes. Workaround is to set num_nodes to 0 or a number greater than 1 and set exec_single_node_rows_threshold to 0.
    • Target Version:

      Description

      A crash occurs if the following query is run in a loop:

      for i in {1..1000}; do impala-shell.sh -q "set num_nodes=1; set DISABLE_CODEGEN=1; set NUM_SCANNER_THREADS=1; set RUNTIME_FILTER_MODE=0; with t as (select int_col x, bigint_col y from functional.alltypestiny order by id limit 2) select * from t t1 left outer join t t2 on t1.y = t2.x full outer join t t3 on t2.y = t3.x order by t1.x limit 10"; done
      

      The query is run in single-node optimization mode (1 node, 1 scanner thread, no codegen, no runtime filters). The crash is non deterministic, so it can take a few minutes for it to occur.

      Stack Trace:

      #0  0x00007f792b3edc37 in __GI_raise (sig=sig@entry=6)
          at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
      #1  0x00007f792b3f1028 in __GI_abort () at abort.c:89
      #2  0x00007f792d613c55 in os::abort(bool) ()
         from /usr/lib/jvm/java-7-oracle-amd64/jre/lib/amd64/server/libjvm.so
      #3  0x00007f792d795cd7 in VMError::report_and_die() ()
         from /usr/lib/jvm/java-7-oracle-amd64/jre/lib/amd64/server/libjvm.so
      #4  0x00007f792d618b6f in JVM_handle_linux_signal ()
         from /usr/lib/jvm/java-7-oracle-amd64/jre/lib/amd64/server/libjvm.so
      #5  <signal handler called>
      #6  __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
      #7  0x00000000010fb23d in pthread_mutex_lock (m=0x71)
          at /tmp/toolchain/boost-1.57.0/include/boost/thread/pthread/mutex.hpp:62
      #8  boost::mutex::lock (this=0x71)
          at /tmp/toolchain/boost-1.57.0/include/boost/thread/pthread/mutex.hpp:116
      #9  0x0000000001104b4c in boost::lock_guard<boost::mutex>::lock_guard (
          this=0x7f78e7531b50, m_=...)
          at /tmp/toolchain/boost-1.57.0/include/boost/thread/lock_guard.hpp:38
      #10 0x000000000133a913 in impala::DiskIoRequestContext::Cancel (this=0x1, status=...)
          at /home/tbobrovytsky/Impala/be/src/runtime/disk-io-mgr-reader-context.cc:30
      #11 0x0000000001328100 in impala::DiskIoMgr::CancelContext (this=0x9046a00, context=0x1, 
          wait_for_disks_completion=true)
          at /home/tbobrovytsky/Impala/be/src/runtime/disk-io-mgr.cc:457
      #12 0x0000000001327d0e in impala::DiskIoMgr::UnregisterContext (this=0x9046a00, 
          reader=0x1) at /home/tbobrovytsky/Impala/be/src/runtime/disk-io-mgr.cc:425
      #13 0x000000000191bcfd in impala::PlanFragmentExecutor::Close (this=0xb231180)
          at /home/tbobrovytsky/Impala/be/src/runtime/plan-fragment-executor.cc:512
      #14 0x000000000191618d in impala::PlanFragmentExecutor::~PlanFragmentExecutor (
          this=0xb231180, __in_chrg=<optimized out>)
          at /home/tbobrovytsky/Impala/be/src/runtime/plan-fragment-executor.cc:73
      #15 0x00000000018f4027 in boost::checked_delete<impala::PlanFragmentExecutor> (
          x=0xb231180) at /tmp/toolchain/boost-1.57.0/include/boost/core/checked_delete.hpp:34
      #16 0x00000000018efd81 in boost::scoped_ptr<impala::PlanFragmentExecutor>::~scoped_ptr (
          this=0x95eac28, __in_chrg=<optimized out>)
          at /tmp/toolchain/boost-1.57.0/include/boost/smart_ptr/scoped_ptr.hpp:82
      #17 0x00000000018db887 in impala::Coordinator::~Coordinator (this=0x95ea800, 
          __in_chrg=<optimized out>)
          at /home/tbobrovytsky/Impala/be/src/runtime/coordinator.cc:365
      #18 0x000000000147f5ff in boost::checked_delete<impala::Coordinator> (x=0x95ea800)
          at /tmp/toolchain/boost-1.57.0/include/boost/core/checked_delete.hpp:34
      #19 0x000000000147b121 in boost::scoped_ptr<impala::Coordinator>::~scoped_ptr (
          this=0xaf3e3c0, __in_chrg=<optimized out>)
          at /tmp/toolchain/boost-1.57.0/include/boost/smart_ptr/scoped_ptr.hpp:82
      #20 0x0000000001469330 in impala::ImpalaServer::QueryExecState::~QueryExecState (
          this=0xaf3e000, __in_chrg=<optimized out>)
          at /home/tbobrovytsky/Impala/be/src/service/query-exec-state.cc:115
      #21 0x0000000001433ed6 in std::_Sp_counted_ptr<impala::ImpalaServer::QueryExecState*, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0xb0ac4e0)
          at /tmp/toolchain/gcc-4.9.2/include/c++/4.9.2/bits/shared_ptr_base.h:373
      #22 0x00000000010efb1a in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (
          this=0xb0ac4e0)
          at /tmp/toolchain/gcc-4.9.2/include/c++/4.9.2/bits/shared_ptr_base.h:149
      #23 0x00000000010ee559 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count
          (this=0x7f78e75321f8, __in_chrg=<optimized out>)
          at /tmp/toolchain/gcc-4.9.2/include/c++/4.9.2/bits/shared_ptr_base.h:666
      #24 0x0000000001402b94 in std::__shared_ptr<impala::ImpalaServer::QueryExecState, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7f78e75321f0, __in_chrg=<optimized out>)
          at /tmp/toolchain/gcc-4.9.2/include/c++/4.9.2/bits/shared_ptr_base.h:914
      #25 0x0000000001402bae in std::shared_ptr<impala::ImpalaServer::QueryExecState>::~shared_ptr (this=0x7f78e75321f0, __in_chrg=<optimized out>)
          at /tmp/toolchain/gcc-4.9.2/include/c++/4.9.2/bits/shared_ptr.h:93
      #26 0x00000000013f0927 in impala::ImpalaServer::UnregisterQuery (this=0xa633e00, 
          query_id=..., check_inflight=true, cause=0x0)
          at /home/tbobrovytsky/Impala/be/src/service/impala-server.cc:977
      #27 0x0000000001460640 in impala::ImpalaServer::close (this=0xa633e00, handle=...)
          at /home/tbobrovytsky/Impala/be/src/service/impala-beeswax-server.cc:355
      #28 0x000000000188e5d2 in beeswax::BeeswaxServiceProcessor::process_close (
          this=0xaec10e0, seqid=0, iprot=0x9da0420, oprot=0x9da18c0, callContext=0xb0e3a00)
          at /home/tbobrovytsky/Impala/be/generated-sources/gen-cpp/BeeswaxService.cpp:3543
      #29 0x00000000018896a2 in beeswax::BeeswaxServiceProcessor::dispatchCall (this=0xaec10e0, 
          iprot=0x9da0420, oprot=0x9da18c0, fname=..., seqid=0, callContext=0xb0e3a00)
          at /home/tbobrovytsky/Impala/be/generated-sources/gen-cpp/BeeswaxService.cpp:2952
      #30 0x000000000187349f in impala::ImpalaServiceProcessor::dispatchCall (this=0xaec10e0, 
          iprot=0x9da0420, oprot=0x9da18c0, fname=..., seqid=0, callContext=0xb0e3a00)
          at /home/tbobrovytsky/Impala/be/generated-sources/gen-cpp/ImpalaService.cpp:1673
      #31 0x00000000010ecdf4 in apache::thrift::TDispatchProcessor::process (this=0xaec10e0, 
          in=..., out=..., connectionContext=0xb0e3a00)
          at /tmp/toolchain/thrift-0.9.0-p8/include/thrift/TDispatchProcessor.h:121
      #32 0x000000000270127b in apache::thrift::server::TThreadPoolServer::Task::run() ()
      #33 0x00000000026e8e59 in apache::thrift::concurrency::ThreadManager::Worker::run() ()
      #34 0x00000000012a7f5f in impala::ThriftThread::RunRunnable (this=0xb0d6bc0, 
          runnable=..., promise=0x7ffc1d58c580)
          at /home/tbobrovytsky/Impala/be/src/rpc/thrift-thread.cc:64
      #35 0x00000000012a96af in boost::_mfi::mf2<void, impala::ThriftThread, boost::shared_ptr<apache::thrift::concurrency::Runnable>, impala::Promise<unsigned long>*>::operator() (
          this=0xb0e0660, p=0xb0d6bc0, a1=..., a2=0x7ffc1d58c580)
          at /tmp/toolchain/boost-1.57.0/include/boost/bind/mem_fn_template.hpp:280
      #36 0x00000000012a9545 in boost::_bi::list3<boost::_bi::value<impala::ThriftThread*>, boost::_bi::value<boost::shared_ptr<apache::thrift::concurrency::Runnable> >, boost::_bi::value<impala::Promise<unsigned long>*> >::operator()<boost::_mfi::mf2<void, impala::ThriftThread, boost::shared_ptr<apache::thrift::concurrency::Runnable>, impala::Promise<unsigned long>*>, boost::_bi::list0> (this=0xb0e0670, f=..., a=...)
          at /tmp/toolchain/boost-1.57.0/include/boost/bind/bind.hpp:392
      #37 0x00000000012a9291 in boost::_bi::bind_t<void, boost::_mfi::mf2<void, impala::ThriftThread, boost::shared_ptr<apache::thrift::concurrency::Runnable>, impala::Promise<unsigned long>*>, boost::_bi::list3<boost::_bi::value<impala::ThriftThread*>, boost::_bi::value<boost::shared_ptr<apache::thrift::concurrency::Runnable> >, boost::_bi::value<impala::Promise<unsigned long>*> > >::operator() (this=0xb0e0660)
          at /tmp/toolchain/boost-1.57.0/include/boost/bind/bind_template.hpp:20
      #38 0x00000000012a91a4 in boost::detail::function::void_function_obj_invoker0<boost::_bi::bind_t<void, boost::_mfi::mf2<void, impala::ThriftThread, boost::shared_ptr<apache::thrift::concurrency::Runnable>, impala::Promise<unsigned long>*>, boost::_bi::list3<boost::_bi::val---Type <return> to continue, or q <return> to quit---
      ue<impala::ThriftThread*>, boost::_bi::value<boost::shared_ptr<apache::thrift::concurrency::Runnable> >, boost::_bi::value<impala::Promise<unsigned long>*> > >, void>::invoke (
          function_obj_ptr=...)
          at /tmp/toolchain/boost-1.57.0/include/boost/function/function_template.hpp:153
      #39 0x00000000012ae736 in boost::function0<void>::operator() (this=0x7f78e7532d60)
          at /tmp/toolchain/boost-1.57.0/include/boost/function/function_template.hpp:767
      #40 0x000000000155da85 in impala::Thread::SuperviseThread(std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>*) (name=..., category=..., 
          functor=..., thread_started=0x7ffc1d58c370)
          at /home/tbobrovytsky/Impala/be/src/util/thread.cc:317
      #41 0x00000000015640dc in boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, boost::_bi::value<impala::Promise<long>*> >::operator()<void (*)(std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>*), boost::_bi::list0>(boost::_bi::type<void>, void (*&)(std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>*), boost::_bi::list0&, int) (this=0xb0ecdc0, 
          f=@0xb0ecdb8: 0x155d7c0 <impala::Thread::SuperviseThread(std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>*)>, a=...)
          at /tmp/toolchain/boost-1.57.0/include/boost/bind/bind.hpp:457
      #42 0x000000000156401f in boost::_bi::bind_t<void, void (*)(std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>*), boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, boost::_bi::value<impala::Promise<long>*> > >::operator()() (this=0xb0ecdb8)
          at /tmp/toolchain/boost-1.57.0/include/boost/bind/bind_template.hpp:20
      #43 0x0000000001563f7a in boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>*), boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, boost::_bi::value<impala::Promise<long>*> > > >::run() (
          this=0xb0ecc00)
          at /tmp/toolchain/boost-1.57.0/include/boost/thread/detail/thread.hpp:116
      #44 0x00000000019801aa in thread_proxy ()
      #45 0x00007f792b784184 in start_thread (arg=0x7f78e7533700) at pthread_create.c:312
      #46 0x00007f792b4b137d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
      

        Activity

        Hide
        kwho Michael Ho added a comment -

        It appears that runtime_state_->reader_contexts_ is corrupted:

        (gdb) pvector ((RuntimeState*)0x9fe3b00)->reader_contexts_
        elem[0]: $21 = (impala::DiskIoRequestContext *) 0xb6b4ee0
        elem[1]: $22 = (impala::DiskIoRequestContext *) 0x1
        elem[2]: $23 = (impala::DiskIoRequestContext *) 0x87a0ee0
        Vector size = 3
        Vector capacity = 1
        Element type = std::_Vector_base<impala::DiskIoRequestContext*, std::allocator<impala::DiskIoRequestContext*> >::pointer

        Show
        kwho Michael Ho added a comment - It appears that runtime_state_->reader_contexts_ is corrupted: (gdb) pvector ((RuntimeState*)0x9fe3b00)->reader_contexts_ elem [0] : $21 = (impala::DiskIoRequestContext *) 0xb6b4ee0 elem [1] : $22 = (impala::DiskIoRequestContext *) 0x1 elem [2] : $23 = (impala::DiskIoRequestContext *) 0x87a0ee0 Vector size = 3 Vector capacity = 1 Element type = std::_Vector_base<impala::DiskIoRequestContext*, std::allocator<impala::DiskIoRequestContext*> >::pointer
        Hide
        tarasbob Taras Bobrovytsky added a comment -

        I put the core file and other required files for gdb into impala-desktop.ca.cloudera.com:/home/dev/IMPALA-4180
        To get the stack trace, cd into the folder on the machine, then type in:

        gdb impalad core.impalad.31114.1474504660
        

        Then run these gdb commands:

        set solib-search-path .
        set sysroot .
        bt
        
        Show
        tarasbob Taras Bobrovytsky added a comment - I put the core file and other required files for gdb into impala-desktop.ca.cloudera.com:/home/dev/ IMPALA-4180 To get the stack trace, cd into the folder on the machine, then type in: gdb impalad core.impalad.31114.1474504660 Then run these gdb commands: set solib-search-path . set sysroot . bt
        Hide
        dhecht Dan Hecht added a comment -

        I ran this on my system and got a different crash, so looks like general memory corruption maybe:

        #6  0x00000000027d782f in tc_newarray ()
        #7  0x000000000112d30c in __gnu_cxx::new_allocator<int>::allocate (this=0x9cddd60, __n=1) at /home/dhecht/toolchain/gcc-4.9.2/include/c++/4.9.2/ext/new_allocator.h:104
        #8  0x000000000112b48b in std::allocator_traits<std::allocator<int> >::allocate (__a=..., __n=1) at /home/dhecht/toolchain/gcc-4.9.2/include/c++/4.9.2/bits/alloc_traits.h:357
        #9  0x0000000001128f38 in std::_Vector_base<int, std::allocator<int> >::_M_allocate (this=0x9cddd60, __n=1) at /home/dhecht/toolchain/gcc-4.9.2/include/c++/4.9.2/bits/stl_vector.h:170
        #10 0x00000000012f2c20 in std::vector<int, std::allocator<int> >::_M_emplace_back_aux<int const&> (this=0x9cddd60) at /home/dhecht/toolchain/gcc-4.9.2/include/c++/4.9.2/bits/vector.tcc:412
        #11 0x00000000012f2359 in std::vector<int, std::allocator<int> >::push_back (this=0x9cddd60, __x=@0x7f034a292fc8: 24) at /home/dhecht/toolchain/gcc-4.9.2/include/c++/4.9.2/bits/stl_vector.h:923
        #12 0x000000000136533b in impala::BufferedTupleStream::BufferedTupleStream (this=0x9cddd40, state=0x89df600, row_desc=..., block_mgr=0xabf2240, client=0x9ee6e80, use_initial_small_buffers=true,
            read_write=false, ext_varlen_slots=Python Exception <class 'ValueError'> Cannot find type const std::set<int, std::less<int>, std::allocator<int> >::_Rep_type:
        std::set with 0 elements) at /home/dhecht/src/Impala/be/src/runtime/buffered-tuple-stream.cc:88
        #13 0x00000000016ddb34 in impala::PartitionedHashJoinNode::Partition::Partition (this=0x9393bf0, state=0x89df600, parent=0x91d7a80, level=0)
            at /home/dhecht/src/Impala/be/src/exec/partitioned-hash-join-node.cc:359
        #14 0x00000000016e0c2b in impala::PartitionedHashJoinNode::ProcessBuildInput (this=0x91d7a80, state=0x89df600, level=0) at /home/dhecht/src/Impala/be/src/exec/partitioned-hash-join-node.cc:670
        #15 0x00000000016e0657 in impala::PartitionedHashJoinNode::ProcessBuildInput (this=0x91d7a80, state=0x89df600) at /home/dhecht/src/Impala/be/src/exec/partitioned-hash-join-node.cc:644
        #16 0x0000000001742a64 in impala::BlockingJoinNode::ProcessBuildInputAsync (this=0x91d7a80, state=0x89df600, build_sink=0x0, status=0x7f034b294da0)
            at /home/dhecht/src/Impala/be/src/exec/blocking-join-node.cc:152
        #17 0x000000000174659d in boost::_mfi::mf3<void, impala::BlockingJoinNode, impala::RuntimeState*, impala::DataSink*, impala::Promise<impala::Status>*>::operator() (this=0x9e51c50, p=0x91d7a80,
            a1=0x89df600, a2=0x0, a3=0x7f034b294da0) at /home/dhecht/toolchain/boost-1.57.0/include/boost/bind/mem_fn_template.hpp:393
        #18 0x0000000001746449 in boost::_bi::list4<boost::_bi::value<impala::BlockingJoinNode*>, boost::_bi::value<impala::RuntimeState*>, boost::_bi::value<impala::DataSink*>, boost::_bi::value<impala::Promise<impala::Status>*> >::operator()<boost::_mfi::mf3<void, impala::BlockingJoinNode, impala::RuntimeState*, impala::DataSink*, impala::Promise<impala::Status>*>, boost::_bi::list0> (this=0x9e51c60, f=...,
            a=...) at /home/dhecht/toolchain/boost-1.57.0/include/boost/bind/bind.hpp:457
        #19 0x0000000001746195 in boost::_bi::bind_t<void, boost::_mfi::mf3<void, impala::BlockingJoinNode, impala::RuntimeState*, impala::DataSink*, impala::Promise<impala::Status>*>, boost::_bi::list4<boost::_bi::value<impala::BlockingJoinNode*>, boost::_bi::value<impala::RuntimeState*>, boost::_bi::value<impala::DataSink*>, boost::_bi::value<impala::Promise<impala::Status>*> > >::operator() (this=0x9e51c50)
            at /home/dhecht/toolchain/boost-1.57.0/include/boost/bind/bind_template.hpp:20
        
        Show
        dhecht Dan Hecht added a comment - I ran this on my system and got a different crash, so looks like general memory corruption maybe: #6 0x00000000027d782f in tc_newarray () #7 0x000000000112d30c in __gnu_cxx::new_allocator< int >::allocate ( this =0x9cddd60, __n=1) at /home/dhecht/toolchain/gcc-4.9.2/include/c++/4.9.2/ext/new_allocator.h:104 #8 0x000000000112b48b in std::allocator_traits<std::allocator< int > >::allocate (__a=..., __n=1) at /home/dhecht/toolchain/gcc-4.9.2/include/c++/4.9.2/bits/alloc_traits.h:357 #9 0x0000000001128f38 in std::_Vector_base< int , std::allocator< int > >::_M_allocate ( this =0x9cddd60, __n=1) at /home/dhecht/toolchain/gcc-4.9.2/include/c++/4.9.2/bits/stl_vector.h:170 #10 0x00000000012f2c20 in std::vector< int , std::allocator< int > >::_M_emplace_back_aux< int const &> ( this =0x9cddd60) at /home/dhecht/toolchain/gcc-4.9.2/include/c++/4.9.2/bits/vector.tcc:412 #11 0x00000000012f2359 in std::vector< int , std::allocator< int > >::push_back ( this =0x9cddd60, __x=@0x7f034a292fc8: 24) at /home/dhecht/toolchain/gcc-4.9.2/include/c++/4.9.2/bits/stl_vector.h:923 #12 0x000000000136533b in impala::BufferedTupleStream::BufferedTupleStream ( this =0x9cddd40, state=0x89df600, row_desc=..., block_mgr=0xabf2240, client=0x9ee6e80, use_initial_small_buffers= true , read_write= false , ext_varlen_slots=Python Exception <class 'ValueError'> Cannot find type const std::set< int , std::less< int >, std::allocator< int > >::_Rep_type: std::set with 0 elements) at /home/dhecht/src/Impala/be/src/runtime/buffered-tuple-stream.cc:88 #13 0x00000000016ddb34 in impala::PartitionedHashJoinNode::Partition::Partition ( this =0x9393bf0, state=0x89df600, parent=0x91d7a80, level=0) at /home/dhecht/src/Impala/be/src/exec/partitioned-hash-join-node.cc:359 #14 0x00000000016e0c2b in impala::PartitionedHashJoinNode::ProcessBuildInput ( this =0x91d7a80, state=0x89df600, level=0) at /home/dhecht/src/Impala/be/src/exec/partitioned-hash-join-node.cc:670 #15 0x00000000016e0657 in impala::PartitionedHashJoinNode::ProcessBuildInput ( this =0x91d7a80, state=0x89df600) at /home/dhecht/src/Impala/be/src/exec/partitioned-hash-join-node.cc:644 #16 0x0000000001742a64 in impala::BlockingJoinNode::ProcessBuildInputAsync ( this =0x91d7a80, state=0x89df600, build_sink=0x0, status=0x7f034b294da0) at /home/dhecht/src/Impala/be/src/exec/blocking-join-node.cc:152 #17 0x000000000174659d in boost::_mfi::mf3<void, impala::BlockingJoinNode, impala::RuntimeState*, impala::DataSink*, impala::Promise<impala::Status>*>:: operator () ( this =0x9e51c50, p=0x91d7a80, a1=0x89df600, a2=0x0, a3=0x7f034b294da0) at /home/dhecht/toolchain/boost-1.57.0/include/boost/bind/mem_fn_template.hpp:393 #18 0x0000000001746449 in boost::_bi::list4<boost::_bi::value<impala::BlockingJoinNode*>, boost::_bi::value<impala::RuntimeState*>, boost::_bi::value<impala::DataSink*>, boost::_bi::value<impala::Promise<impala::Status>*> >:: operator ()<boost::_mfi::mf3<void, impala::BlockingJoinNode, impala::RuntimeState*, impala::DataSink*, impala::Promise<impala::Status>*>, boost::_bi::list0> ( this =0x9e51c60, f=..., a=...) at /home/dhecht/toolchain/boost-1.57.0/include/boost/bind/bind.hpp:457 #19 0x0000000001746195 in boost::_bi::bind_t<void, boost::_mfi::mf3<void, impala::BlockingJoinNode, impala::RuntimeState*, impala::DataSink*, impala::Promise<impala::Status>*>, boost::_bi::list4<boost::_bi::value<impala::BlockingJoinNode*>, boost::_bi::value<impala::RuntimeState*>, boost::_bi::value<impala::DataSink*>, boost::_bi::value<impala::Promise<impala::Status>*> > >:: operator () ( this =0x9e51c50) at /home/dhecht/toolchain/boost-1.57.0/include/boost/bind/bind_template.hpp:20
        Hide
        dhecht Dan Hecht added a comment -

        Another bt:

        #58 <signal handler called>
        #59 0x00000000027c8aa3 in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int) ()
        #60 0x00000000027c8e7f in tcmalloc::ThreadCache::Scavenge() ()
        #61 0x00000000027d5c3a in tc_free ()
        #62 0x00000000014e8080 in apache::thrift::transport::TMemoryBuffer::~TMemoryBuffer (this=0xadb8c80, __in_chrg=<optimized out>) at /home/dhecht/toolchain/thrift-0.9.0-p8/include/thrift/transport/TBufferTransports.h:556
        #63 0x00000000014e80bc in apache::thrift::transport::TMemoryBuffer::~TMemoryBuffer (this=0xadb8c80, __in_chrg=<optimized out>) at /home/dhecht/toolchain/thrift-0.9.0-p8/include/thrift/transport/TBufferTransports.h:558
        #64 0x000000000111467a in boost::checked_delete<apache::thrift::transport::TMemoryBuffer> (x=0xadb8c80) at /home/dhecht/toolchain/boost-1.57.0/include/boost/core/checked_delete.hpp:34
        #65 0x000000000111c1de in boost::detail::sp_counted_impl_p<apache::thrift::transport::TMemoryBuffer>::dispose (this=0xa63cd80) at /home/dhecht/toolchain/boost-1.57.0/include/boost/smart_ptr/detail/sp_counted_impl.hpp:78
        #66 0x00000000010f07b2 in boost::detail::sp_counted_base::release (this=0xa63cd80) at /home/dhecht/toolchain/boost-1.57.0/include/boost/smart_ptr/detail/sp_counted_base_gcc_x86.hpp:146
        #67 0x00000000010f0841 in boost::detail::shared_count::~shared_count (this=0x7f4ba356f718, __in_chrg=<optimized out>) at /home/dhecht/toolchain/boost-1.57.0/include/boost/smart_ptr/detail/shared_count.hpp:443
        #68 0x00000000010f1b9e in boost::shared_ptr<apache::thrift::transport::TMemoryBuffer>::~shared_ptr (this=0x7f4ba356f710, __in_chrg=<optimized out>) at /home/dhecht/toolchain/boost-1.57.0/include/boost/smart_ptr/shared_ptr.hpp:323
        #69 0x00000000010f1bc8 in impala::ThriftSerializer::~ThriftSerializer (this=0x7f4ba356f710, __in_chrg=<optimized out>) at /home/dhecht/src/Impala/be/src/rpc/thrift-util.h:42
        #70 0x00000000013dd4b0 in impala::SerializeThriftMsg<impala::TQueryCtx const> (env=0xbcf29e8, msg=0x7f4ba3571000, serialized_msg=0x7f4ba356f840) at /home/dhecht/src/Impala/be/src/rpc/jni-thrift-util.h:44
        #71 0x00000000013d90b6 in impala::JniUtil::CallJniMethod<impala::TQueryCtx, impala::TExecRequest> (obj=@0xa107a08: 0x9f32378, method=@0xa107a10: 0x94ccb40, arg=..., response=0x7f4ba356f910)
            at /home/dhecht/src/Impala/be/src/util/jni-util.h:263
        #72 0x00000000013d55a9 in impala::Frontend::GetExecRequest (this=0xa107a00, query_ctx=..., result=0x7f4ba356f910) at /home/dhecht/src/Impala/be/src/service/frontend.cc:202
        #73 0x00000000013f27d6 in impala::ImpalaServer::ExecuteInternal (this=0xa5ad000, query_ctx=..., session_state=std::shared_ptr (count 6, weak 0) 0xa06ee00, registered_exec_state=0x7f4ba3570f6f, exec_state=0x7f4ba35712f0)
            at /home/dhecht/src/Impala/be/src/service/impala-server.cc:806
        #74 0x00000000013f2360 in impala::ImpalaServer::Execute (this=0xa5ad000, query_ctx=0x7f4ba3571000, session_state=std::shared_ptr (count 6, weak 0) 0xa06ee00, exec_state=0x7f4ba35712f0)
            at /home/dhecht/src/Impala/be/src/service/impala-server.cc:765
        #75 0x0000000001461a4b in impala::ImpalaServer::query (this=0xa5ad000, query_handle=..., query=...) at /home/dhecht/src/Impala/be/src/service/impala-beeswax-server.cc:188
        #76 0x000000000188f488 in beeswax::BeeswaxServiceProcessor::process_query (this=0xbd757c0, seqid=0, iprot=0xbde32c0, oprot=0xb38c720, callContext=0xadb9d80)
            at /home/dhecht/src/Impala/be/generated-sources/gen-cpp/BeeswaxService.cpp:2979
        #77 0x000000000188f1d6 in beeswax::BeeswaxServiceProcessor::dispatchCall (this=0xbd757c0, iprot=0xbde32c0, oprot=0xb38c720, fname="query", seqid=0, callContext=0xadb9d80)
            at /home/dhecht/src/Impala/be/generated-sources/gen-cpp/BeeswaxService.cpp:2952
        
        Show
        dhecht Dan Hecht added a comment - Another bt: #58 <signal handler called> #59 0x00000000027c8aa3 in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long , int ) () #60 0x00000000027c8e7f in tcmalloc::ThreadCache::Scavenge() () #61 0x00000000027d5c3a in tc_free () #62 0x00000000014e8080 in apache::thrift::transport::TMemoryBuffer::~TMemoryBuffer ( this =0xadb8c80, __in_chrg=<optimized out>) at /home/dhecht/toolchain/thrift-0.9.0-p8/include/thrift/transport/TBufferTransports.h:556 #63 0x00000000014e80bc in apache::thrift::transport::TMemoryBuffer::~TMemoryBuffer ( this =0xadb8c80, __in_chrg=<optimized out>) at /home/dhecht/toolchain/thrift-0.9.0-p8/include/thrift/transport/TBufferTransports.h:558 #64 0x000000000111467a in boost::checked_delete<apache::thrift::transport::TMemoryBuffer> (x=0xadb8c80) at /home/dhecht/toolchain/boost-1.57.0/include/boost/core/checked_delete.hpp:34 #65 0x000000000111c1de in boost::detail::sp_counted_impl_p<apache::thrift::transport::TMemoryBuffer>::dispose ( this =0xa63cd80) at /home/dhecht/toolchain/boost-1.57.0/include/boost/smart_ptr/detail/sp_counted_impl.hpp:78 #66 0x00000000010f07b2 in boost::detail::sp_counted_base::release ( this =0xa63cd80) at /home/dhecht/toolchain/boost-1.57.0/include/boost/smart_ptr/detail/sp_counted_base_gcc_x86.hpp:146 #67 0x00000000010f0841 in boost::detail::shared_count::~shared_count ( this =0x7f4ba356f718, __in_chrg=<optimized out>) at /home/dhecht/toolchain/boost-1.57.0/include/boost/smart_ptr/detail/shared_count.hpp:443 #68 0x00000000010f1b9e in boost::shared_ptr<apache::thrift::transport::TMemoryBuffer>::~shared_ptr ( this =0x7f4ba356f710, __in_chrg=<optimized out>) at /home/dhecht/toolchain/boost-1.57.0/include/boost/smart_ptr/shared_ptr.hpp:323 #69 0x00000000010f1bc8 in impala::ThriftSerializer::~ThriftSerializer ( this =0x7f4ba356f710, __in_chrg=<optimized out>) at /home/dhecht/src/Impala/be/src/rpc/thrift-util.h:42 #70 0x00000000013dd4b0 in impala::SerializeThriftMsg<impala::TQueryCtx const > (env=0xbcf29e8, msg=0x7f4ba3571000, serialized_msg=0x7f4ba356f840) at /home/dhecht/src/Impala/be/src/rpc/jni-thrift-util.h:44 #71 0x00000000013d90b6 in impala::JniUtil::CallJniMethod<impala::TQueryCtx, impala::TExecRequest> (obj=@0xa107a08: 0x9f32378, method=@0xa107a10: 0x94ccb40, arg=..., response=0x7f4ba356f910) at /home/dhecht/src/Impala/be/src/util/jni-util.h:263 #72 0x00000000013d55a9 in impala::Frontend::GetExecRequest ( this =0xa107a00, query_ctx=..., result=0x7f4ba356f910) at /home/dhecht/src/Impala/be/src/service/frontend.cc:202 #73 0x00000000013f27d6 in impala::ImpalaServer::ExecuteInternal ( this =0xa5ad000, query_ctx=..., session_state=std::shared_ptr (count 6, weak 0) 0xa06ee00, registered_exec_state=0x7f4ba3570f6f, exec_state=0x7f4ba35712f0) at /home/dhecht/src/Impala/be/src/service/impala-server.cc:806 #74 0x00000000013f2360 in impala::ImpalaServer::Execute ( this =0xa5ad000, query_ctx=0x7f4ba3571000, session_state=std::shared_ptr (count 6, weak 0) 0xa06ee00, exec_state=0x7f4ba35712f0) at /home/dhecht/src/Impala/be/src/service/impala-server.cc:765 #75 0x0000000001461a4b in impala::ImpalaServer::query ( this =0xa5ad000, query_handle=..., query=...) at /home/dhecht/src/Impala/be/src/service/impala-beeswax-server.cc:188 #76 0x000000000188f488 in beeswax::BeeswaxServiceProcessor::process_query ( this =0xbd757c0, seqid=0, iprot=0xbde32c0, oprot=0xb38c720, callContext=0xadb9d80) at /home/dhecht/src/Impala/be/generated-sources/gen-cpp/BeeswaxService.cpp:2979 #77 0x000000000188f1d6 in beeswax::BeeswaxServiceProcessor::dispatchCall ( this =0xbd757c0, iprot=0xbde32c0, oprot=0xb38c720, fname= "query" , seqid=0, callContext=0xadb9d80) at /home/dhecht/src/Impala/be/generated-sources/gen-cpp/BeeswaxService.cpp:2952
        Hide
        kwho Michael Ho added a comment -

        FWIW, I still cannot reproduce it on my own box. Han Xu can reproduce it consistently on his machine and it's confirmed that the above query without "order by id limit 2" in the view (i.e. the plan ends up with only one top-n node) doesn't crash.

        Show
        kwho Michael Ho added a comment - FWIW, I still cannot reproduce it on my own box. Han Xu can reproduce it consistently on his machine and it's confirmed that the above query without "order by id limit 2" in the view (i.e. the plan ends up with only one top-n node) doesn't crash.
        Hide
        dhecht Dan Hecht added a comment -

        With --disable_mem_pools=true, I seem to consistently get the crash here:

        #4  0x00007f681fb1b8af in JVM_handle_linux_signal () from /usr/lib/jvm/java-7-oracle/jre/lib/amd64/server/libjvm.so
        #5  <signal handler called>
        #6  0x00000000027c8aa3 in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int) ()
        #7  0x00000000027c8b3c in tcmalloc::ThreadCache::ListTooLong(tcmalloc::ThreadCache::FreeList*, unsigned long) ()
        #8  0x00000000027d5cd0 in tc_free ()
        #9  0x00000000012e2334 in __gnu_cxx::new_allocator<impala::DiskIoRequestContext*>::deallocate (this=0xa5e9540, __p=0x58e3ee0) at /home/dhecht/toolchain/gcc-4.9.2/include/c++/4.9.2/ext/new_allocator.h:110
        #10 0x00000000012e11e5 in std::allocator_traits<std::allocator<impala::DiskIoRequestContext*> >::deallocate (__a=..., __p=0x58e3ee0, __n=1) at /home/dhecht/toolchain/gcc-4.9.2/include/c++/4.9.2/bits/alloc_traits.h:383
        #11 0x00000000012dfbc2 in std::_Vector_base<impala::DiskIoRequestContext*, std::allocator<impala::DiskIoRequestContext*> >::_M_deallocate (this=0xa5e9540, __p=0x58e3ee0, __n=1)
            at /home/dhecht/toolchain/gcc-4.9.2/include/c++/4.9.2/bits/stl_vector.h:178
        #12 0x000000000160c5ce in std::vector<impala::DiskIoRequestContext*, std::allocator<impala::DiskIoRequestContext*> >::_M_emplace_back_aux<impala::DiskIoRequestContext* const&> (this=0xa5e9540)
            at /home/dhecht/toolchain/gcc-4.9.2/include/c++/4.9.2/bits/vector.tcc:438
        #13 0x00000000016099c3 in std::vector<impala::DiskIoRequestContext*, std::allocator<impala::DiskIoRequestContext*> >::push_back (this=0xa5e9540, __x=@0x986ebd8: 0x95321e0)
            at /home/dhecht/toolchain/gcc-4.9.2/include/c++/4.9.2/bits/stl_vector.h:923
        #14 0x00000000016034e9 in impala::HdfsScanNodeBase::Close (this=0x986ea00, state=0xa5e8d00) at /home/dhecht/src/Impala/be/src/exec/hdfs-scan-node-base.cc:449
        #15 0x00000000015f67ae in impala::HdfsScanNode::Close (this=0x986ea00, state=0xa5e8d00) at /home/dhecht/src/Impala/be/src/exec/hdfs-scan-node.cc:224
        #16 0x000000000171cb6e in impala::TopNNode::Open (this=0xad37d40, state=0xa5e8d00) at /home/dhecht/src/Impala/be/src/exec/topn-node.cc:163
        #17 0x0000000001742eae in impala::BlockingJoinNode::ConstructBuildAndOpenProbe (this=0x8dcd600, state=0xa5e8d00, build_sink=0x0) at /home/dhecht/src/Impala/be/src/exec/blocking-join-node.cc:201
        #18 0x00000000016dd13e in impala::PartitionedHashJoinNode::Open (this=0x8dcd600, state=0xa5e8d00) at /home/dhecht/src/Impala/be/src/exec/partitioned-hash-join-node.cc:275
        #19 0x0000000001742eae in impala::BlockingJoinNode::ConstructBuildAndOpenProbe (this=0xace5b00, state=0xa5e8d00, build_sink=0x0) at /home/dhecht/src/Impala/be/src/exec/blocking-join-node.cc:201
        #20 0x00000000016dd13e in impala::PartitionedHashJoinNode::Open (this=0xace5b00, state=0xa5e8d00) at /home/dhecht/src/Impala/be/src/exec/partitioned-hash-join-node.cc:275
        #21 0x000000000171c66e in impala::TopNNode::Open (this=0xad378c0, state=0xa5e8d00) at /home/dhecht/src/Impala/be/src/exec/topn-node.cc:137
        #22 0x000000000191f37c in impala::PlanFragmentExecutor::OpenInternal (this=0x7d62d00) at /home/dhecht/src/Impala/be/src/runtime/plan-fragment-executor.cc:301
        #23 0x000000000191f070 in impala::PlanFragmentExecutor::Open (this=0x7d62d00) at /home/dhecht/src/Impala/be/src/runtime/plan-fragment-executor.cc:274
        #24 0x00000000018e93d9 in impala::Coordinator::Wait (this=0xad65000) at /home/dhecht/src/Impala/be/src/runtime/coordinator.cc:1093
        #25 0x00000000014724ff in impala::ImpalaServer::QueryExecState::WaitInternal (this=0x9574000) at /home/dhecht/src/Impala/be/src/service/query-exec-state.cc:622
        #26 0x0000000001471ff6 in impala::ImpalaServer::QueryExecState::Wait (this=0x9574000) at /home/dhecht/src/Impala/be/src/service/query-exec-state.cc:590
        #27 0x000000000148ba55 in boost::_mfi::mf0<void, impala::ImpalaServer::QueryExecState>::operator() (this=0x7f6795f86d68, p=0x9574000) at /home/dhecht/toolchain/boost-1.57.0/include/boost/bind/mem_fn_template.hpp:49
        #28 0x000000000148b564 in boost::_bi::list1<boost::_bi::value<impala::ImpalaServer::QueryExecState*> >::operator()<boost::_mfi::mf0<void, impala::ImpalaServer::QueryExecState>, boost::_bi::list0> (this=0x7f6795f86d78,
            f=..., a=...) at /home/dhecht/toolchain/boost-1.57.0/include/boost/bind/bind.hpp:253
        
        Show
        dhecht Dan Hecht added a comment - With --disable_mem_pools=true, I seem to consistently get the crash here: #4 0x00007f681fb1b8af in JVM_handle_linux_signal () from /usr/lib/jvm/java-7-oracle/jre/lib/amd64/server/libjvm.so #5 <signal handler called> #6 0x00000000027c8aa3 in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long , int ) () #7 0x00000000027c8b3c in tcmalloc::ThreadCache::ListTooLong(tcmalloc::ThreadCache::FreeList*, unsigned long ) () #8 0x00000000027d5cd0 in tc_free () #9 0x00000000012e2334 in __gnu_cxx::new_allocator<impala::DiskIoRequestContext*>::deallocate ( this =0xa5e9540, __p=0x58e3ee0) at /home/dhecht/toolchain/gcc-4.9.2/include/c++/4.9.2/ext/new_allocator.h:110 #10 0x00000000012e11e5 in std::allocator_traits<std::allocator<impala::DiskIoRequestContext*> >::deallocate (__a=..., __p=0x58e3ee0, __n=1) at /home/dhecht/toolchain/gcc-4.9.2/include/c++/4.9.2/bits/alloc_traits.h:383 #11 0x00000000012dfbc2 in std::_Vector_base<impala::DiskIoRequestContext*, std::allocator<impala::DiskIoRequestContext*> >::_M_deallocate ( this =0xa5e9540, __p=0x58e3ee0, __n=1) at /home/dhecht/toolchain/gcc-4.9.2/include/c++/4.9.2/bits/stl_vector.h:178 #12 0x000000000160c5ce in std::vector<impala::DiskIoRequestContext*, std::allocator<impala::DiskIoRequestContext*> >::_M_emplace_back_aux<impala::DiskIoRequestContext* const &> ( this =0xa5e9540) at /home/dhecht/toolchain/gcc-4.9.2/include/c++/4.9.2/bits/vector.tcc:438 #13 0x00000000016099c3 in std::vector<impala::DiskIoRequestContext*, std::allocator<impala::DiskIoRequestContext*> >::push_back ( this =0xa5e9540, __x=@0x986ebd8: 0x95321e0) at /home/dhecht/toolchain/gcc-4.9.2/include/c++/4.9.2/bits/stl_vector.h:923 #14 0x00000000016034e9 in impala::HdfsScanNodeBase::Close ( this =0x986ea00, state=0xa5e8d00) at /home/dhecht/src/Impala/be/src/exec/hdfs-scan-node-base.cc:449 #15 0x00000000015f67ae in impala::HdfsScanNode::Close ( this =0x986ea00, state=0xa5e8d00) at /home/dhecht/src/Impala/be/src/exec/hdfs-scan-node.cc:224 #16 0x000000000171cb6e in impala::TopNNode::Open ( this =0xad37d40, state=0xa5e8d00) at /home/dhecht/src/Impala/be/src/exec/topn-node.cc:163 #17 0x0000000001742eae in impala::BlockingJoinNode::ConstructBuildAndOpenProbe ( this =0x8dcd600, state=0xa5e8d00, build_sink=0x0) at /home/dhecht/src/Impala/be/src/exec/blocking-join-node.cc:201 #18 0x00000000016dd13e in impala::PartitionedHashJoinNode::Open ( this =0x8dcd600, state=0xa5e8d00) at /home/dhecht/src/Impala/be/src/exec/partitioned-hash-join-node.cc:275 #19 0x0000000001742eae in impala::BlockingJoinNode::ConstructBuildAndOpenProbe ( this =0xace5b00, state=0xa5e8d00, build_sink=0x0) at /home/dhecht/src/Impala/be/src/exec/blocking-join-node.cc:201 #20 0x00000000016dd13e in impala::PartitionedHashJoinNode::Open ( this =0xace5b00, state=0xa5e8d00) at /home/dhecht/src/Impala/be/src/exec/partitioned-hash-join-node.cc:275 #21 0x000000000171c66e in impala::TopNNode::Open ( this =0xad378c0, state=0xa5e8d00) at /home/dhecht/src/Impala/be/src/exec/topn-node.cc:137 #22 0x000000000191f37c in impala::PlanFragmentExecutor::OpenInternal ( this =0x7d62d00) at /home/dhecht/src/Impala/be/src/runtime/plan-fragment-executor.cc:301 #23 0x000000000191f070 in impala::PlanFragmentExecutor::Open ( this =0x7d62d00) at /home/dhecht/src/Impala/be/src/runtime/plan-fragment-executor.cc:274 #24 0x00000000018e93d9 in impala::Coordinator::Wait ( this =0xad65000) at /home/dhecht/src/Impala/be/src/runtime/coordinator.cc:1093 #25 0x00000000014724ff in impala::ImpalaServer::QueryExecState::WaitInternal ( this =0x9574000) at /home/dhecht/src/Impala/be/src/service/query-exec-state.cc:622 #26 0x0000000001471ff6 in impala::ImpalaServer::QueryExecState::Wait ( this =0x9574000) at /home/dhecht/src/Impala/be/src/service/query-exec-state.cc:590 #27 0x000000000148ba55 in boost::_mfi::mf0<void, impala::ImpalaServer::QueryExecState>:: operator () ( this =0x7f6795f86d68, p=0x9574000) at /home/dhecht/toolchain/boost-1.57.0/include/boost/bind/mem_fn_template.hpp:49 #28 0x000000000148b564 in boost::_bi::list1<boost::_bi::value<impala::ImpalaServer::QueryExecState*> >:: operator ()<boost::_mfi::mf0<void, impala::ImpalaServer::QueryExecState>, boost::_bi::list0> ( this =0x7f6795f86d78, f=..., a=...) at /home/dhecht/toolchain/boost-1.57.0/include/boost/bind/bind.hpp:253
        Hide
        kwho Michael Ho added a comment -

        Thanks. The stacktrace is very useful. It appears that TopNNode::Open() is the culprit:

        Status TopNNode::Open(RuntimeState* state) {
        ...
        ....
          // Unless we are inside a subplan expecting to call Open()/GetNext() on the child
          // again, the child can be closed at this point.
          if (!IsInSubplan()) child(0)->Close(state); <<-----
        }
        

        Given runtime_state_ is shared, the following code is pretty much not thread safe but we do spin off a separate thread for build side and open the probe side concurrently:

        void HdfsScanNodeBase::Close(RuntimeState* state) {
          if (is_closed()) return;
        
          if (reader_context_ != NULL) {
            // There may still be io buffers used by parent nodes so we can't unregister the
            // reader context yet. The runtime state keeps a list of all the reader contexts and
            // they are unregistered when the fragment is closed.
            state->reader_contexts()->push_back(reader_context_); <<<------
            // Need to wait for all the active scanner threads to finish to ensure there is no
            // more memory tracked by this scan node's mem tracker.
            state->io_mgr()->CancelContext(reader_context_, true);
          }
        

        This is an artifact of set num_nodes=1;

        Show
        kwho Michael Ho added a comment - Thanks. The stacktrace is very useful. It appears that TopNNode::Open() is the culprit: Status TopNNode::Open(RuntimeState* state) { ... .... // Unless we are inside a subplan expecting to call Open()/GetNext() on the child // again, the child can be closed at this point. if (!IsInSubplan()) child(0)->Close(state); <<----- } Given runtime_state_ is shared, the following code is pretty much not thread safe but we do spin off a separate thread for build side and open the probe side concurrently: void HdfsScanNodeBase::Close(RuntimeState* state) { if (is_closed()) return; if (reader_context_ != NULL) { // There may still be io buffers used by parent nodes so we can't unregister the // reader context yet. The runtime state keeps a list of all the reader contexts and // they are unregistered when the fragment is closed. state->reader_contexts()->push_back(reader_context_); <<<------ // Need to wait for all the active scanner threads to finish to ensure there is no // more memory tracked by this scan node's mem tracker. state->io_mgr()->CancelContext(reader_context_, true); } This is an artifact of set num_nodes=1;
        Hide
        dhecht Dan Hecht added a comment -

        Thanks Michael Ho. But what is racing? Is the async build thread hitting an error and closing it's input simultaneously?

        ProcessBuildInputAsync()
          if (!s.ok()) child(1)->Close(state);
        

        reader_contexts() is only used while closing, right?

        Show
        dhecht Dan Hecht added a comment - Thanks Michael Ho . But what is racing? Is the async build thread hitting an error and closing it's input simultaneously? ProcessBuildInputAsync() if (!s.ok()) child(1)->Close(state); reader_contexts() is only used while closing, right?
        Hide
        kwho Michael Ho added a comment -

        Dan Hecht, there is no error involved. If you look at the single node plan, plan node 06 has two top-n nodes
        as its children. While async build thread is in progress, we will also issue Open() for the probe side.
        TopNNode::Open() calls Close() on its children which means HdfsScanNodeBase::Close() can be called
        simultaneously on both the build side and the probe side, leading to the corruption of the reader_contexts() vector
        as the same RuntimeState is shared by both scan nodes.

          08:TOP-N [LIMIT=10]                            
          |  order by: x ASC                             
          |                                               
          07:HASH JOIN [FULL OUTER JOIN]                 
          |  hash predicates: bigint_col = int_col       
          |                                              
          |--05:TOP-N [LIMIT=2]                          
          |  |  order by: id ASC                         
          |  |                                           
          |  04:SCAN HDFS [functional.alltypestiny]      
          |     partitions=4/4 files=4 size=460B        
          |                                             
          06:HASH JOIN [LEFT OUTER JOIN]         
          |  hash predicates: bigint_col = int_col
          |                                       
          |--03:TOP-N [LIMIT=2]                   
          |  |  order by: id ASC                  
          |  |                                    
          |  02:SCAN HDFS [functional.alltypestiny]
          |     partitions=4/4 files=4 size=460B   
          |                                        
          01:TOP-N [LIMIT=2]                       
          |  order by: id ASC                      
          |                                        
          00:SCAN HDFS [functional.alltypestiny]   
             partitions=4/4 files=4 size=460B      
        
        Show
        kwho Michael Ho added a comment - Dan Hecht , there is no error involved. If you look at the single node plan, plan node 06 has two top-n nodes as its children. While async build thread is in progress, we will also issue Open() for the probe side. TopNNode::Open() calls Close() on its children which means HdfsScanNodeBase::Close() can be called simultaneously on both the build side and the probe side, leading to the corruption of the reader_contexts() vector as the same RuntimeState is shared by both scan nodes. 08:TOP-N [LIMIT=10] | order by: x ASC | 07:HASH JOIN [FULL OUTER JOIN] | hash predicates: bigint_col = int_col | |--05:TOP-N [LIMIT=2] | | order by: id ASC | | | 04:SCAN HDFS [functional.alltypestiny] | partitions=4/4 files=4 size=460B | 06:HASH JOIN [LEFT OUTER JOIN] | hash predicates: bigint_col = int_col | |--03:TOP-N [LIMIT=2] | | order by: id ASC | | | 02:SCAN HDFS [functional.alltypestiny] | partitions=4/4 files=4 size=460B | 01:TOP-N [LIMIT=2] | order by: id ASC | 00:SCAN HDFS [functional.alltypestiny] partitions=4/4 files=4 size=460B
        Hide
        dhecht Dan Hecht added a comment -

        I see what you mean, thanks. I guess this will go away with MT so probably not worth doing much now about it since it's specific to num_nodes=1, right?

        Show
        dhecht Dan Hecht added a comment - I see what you mean, thanks. I guess this will go away with MT so probably not worth doing much now about it since it's specific to num_nodes=1, right?
        Hide
        kwho Michael Ho added a comment -

        I suppose it's more likely to happen with num_nodes=1 but is there any chance our planner will generate fragment with multiple top-n nodes in it even with num_nodes > 1 ?

        I don't know what the actual impact will be but it seems nicer for TopN-node to not close its child so early. If it follows the convention of other nodes, this problem will not exist. May be it should call Reset() on its child instead.

        Show
        kwho Michael Ho added a comment - I suppose it's more likely to happen with num_nodes=1 but is there any chance our planner will generate fragment with multiple top-n nodes in it even with num_nodes > 1 ? I don't know what the actual impact will be but it seems nicer for TopN-node to not close its child so early. If it follows the convention of other nodes, this problem will not exist. May be it should call Reset() on its child instead.
        Hide
        dhecht Dan Hecht added a comment -

        The ProcessBuildInputAsync() call to Close() can result in the same issue though, right? So fixing top-n isn't sufficient. Why not introduce a reader_contexts_lock_ for now?

        Show
        dhecht Dan Hecht added a comment - The ProcessBuildInputAsync() call to Close() can result in the same issue though, right? So fixing top-n isn't sufficient. Why not introduce a reader_contexts_lock_ for now?
        Hide
        dhecht Dan Hecht added a comment -

        Re-upgrading given that this isn't particular to num_nodes=1 after all. Michael Ho are you able to take this one too, or should we find someone else to do the fix?

        Show
        dhecht Dan Hecht added a comment - Re-upgrading given that this isn't particular to num_nodes=1 after all. Michael Ho are you able to take this one too, or should we find someone else to do the fix?
        Hide
        alex.behm Alexander Behm added a comment -

        Michael Ho, Dan Hecht, just for clarification, can you provide an example where this issue can occur with num_nodes=0? I don't see how that's possible because you can only have two scan nodes in the same fragment with num_nodes=1. Of course, this is still a blocker because we can automatically set num_nodes=1 without user intervention based on EXEC_SINGLE_NODE_ROWS_THRESHOLD.

        Show
        alex.behm Alexander Behm added a comment - Michael Ho , Dan Hecht , just for clarification, can you provide an example where this issue can occur with num_nodes=0? I don't see how that's possible because you can only have two scan nodes in the same fragment with num_nodes=1. Of course, this is still a blocker because we can automatically set num_nodes=1 without user intervention based on EXEC_SINGLE_NODE_ROWS_THRESHOLD.
        Hide
        kwho Michael Ho added a comment -

        Alexander Behm, looks like you answered the question already. The single node optimization may kick in even with num_nodes=0.

        Show
        kwho Michael Ho added a comment - Alexander Behm , looks like you answered the question already. The single node optimization may kick in even with num_nodes=0.
        Hide
        tarasbob Taras Bobrovytsky added a comment -

        The following query causes a crash as well:

        for i in {1..1000}; do impala-shell.sh -q "set num_nodes=1; set DISABLE_CODEGEN=1; set NUM_SCANNER_THREADS=1; set RUNTIME_FILTER_MODE=0; with t as (select int_col x from functional.alltypestiny order by id limit 2) select * from t t1 left join t t2 on t1.x > 0"; done
        

        I verified that both the original query and the query above cause a crash on a5e84ac commit.

        Show
        tarasbob Taras Bobrovytsky added a comment - The following query causes a crash as well: for i in {1..1000}; do impala-shell.sh -q "set num_nodes=1; set DISABLE_CODEGEN=1; set NUM_SCANNER_THREADS=1; set RUNTIME_FILTER_MODE=0; with t as (select int_col x from functional.alltypestiny order by id limit 2) select * from t t1 left join t t2 on t1.x > 0" ; done I verified that both the original query and the query above cause a crash on a5e84ac commit.
        Hide
        kwho Michael Ho added a comment -

        https://github.com/apache/incubator-impala/commit/2a31fbdbfac9a7092c96e4ab9894e0db0e4ce9ca

        IMPALA-4180: Synchronize accesses to RuntimeState::reader_contexts_

        HdfsScanNodeBase::Close() may add its outstanding DiskIO context to
        RuntimeState::reader_contexts_ to be unregistered later when the
        fragment is closed. In a plan fragment with multiple HDFS scan nodes,
        it's possible for HdfsScanNodeBase::Close() to be called concurrently.
        To allow safe concurrent accesses, this change adds a SpinLock to
        synchronize accesses to 'reader_contexts_' in RuntimeState.

        Change-Id: I911fda526a99514b12f88a3e9fb5952ea4fe1973
        Reviewed-on: http://gerrit.cloudera.org:8080/4558
        Reviewed-by: Dan Hecht <dhecht@cloudera.com>
        Tested-by: Internal Jenkins

        Show
        kwho Michael Ho added a comment - https://github.com/apache/incubator-impala/commit/2a31fbdbfac9a7092c96e4ab9894e0db0e4ce9ca IMPALA-4180 : Synchronize accesses to RuntimeState::reader_contexts_ HdfsScanNodeBase::Close() may add its outstanding DiskIO context to RuntimeState::reader_contexts_ to be unregistered later when the fragment is closed. In a plan fragment with multiple HDFS scan nodes, it's possible for HdfsScanNodeBase::Close() to be called concurrently. To allow safe concurrent accesses, this change adds a SpinLock to synchronize accesses to 'reader_contexts_' in RuntimeState. Change-Id: I911fda526a99514b12f88a3e9fb5952ea4fe1973 Reviewed-on: http://gerrit.cloudera.org:8080/4558 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins

          People

          • Assignee:
            kwho Michael Ho
            Reporter:
            tarasbob Taras Bobrovytsky
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development