Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-5186

stress test caused crash in HdfsParquetScanner::Close()

    Details

      Description

      A stress test at commit IMPALA-5154: Handle 'unpartitioned' Kudu tables caused a crash. Here is the backtrace:

      #0  0x000000323f632625 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
      #1  0x000000323f633e05 in abort () at abort.c:92
      #2  0x00007fe3aa44ca55 in os::abort(bool) () from /usr/java/jdk1.7.0_67-cloudera/jre/lib/amd64/server/libjvm.so
      #3  0x00007fe3aa5ccf87 in VMError::report_and_die() () from /usr/java/jdk1.7.0_67-cloudera/jre/lib/amd64/server/libjvm.so
      #4  0x00007fe3aa45196f in JVM_handle_linux_signal () from /usr/java/jdk1.7.0_67-cloudera/jre/lib/amd64/server/libjvm.so
      #5  <signal handler called>
      #6  0x0000000001723845 in impala::HdfsParquetScanner::Close (this=0x7fdc12b96680, row_batch=0x0)
          at /usr/src/debug/impala-2.8.0-cdh5.12.0-SNAPSHOT/be/src/exec/hdfs-parquet-scanner.cc:280
      #7  0x00000000016f8ceb in impala::HdfsScanNodeMt::Close (this=0x7fe0b8c1f600, state=0x7fe0c724f100)
          at /usr/src/debug/impala-2.8.0-cdh5.12.0-SNAPSHOT/be/src/exec/hdfs-scan-node-mt.cc:123
      #8  0x00000000016a981a in impala::ExecNode::Close (this=0x1149a83c0, state=0x7fe0c724f100) at /usr/src/debug/impala-2.8.0-cdh5.12.0-SNAPSHOT/be/src/exec/exec-node.cc:198
      #9  0x00000000017a5233 in impala::PartitionedAggregationNode::Close (this=0x1149a83c0, state=0x7fe0c724f100)
          at /usr/src/debug/impala-2.8.0-cdh5.12.0-SNAPSHOT/be/src/exec/partitioned-aggregation-node.cc:726
      #10 0x0000000001a752e9 in impala::PlanFragmentExecutor::Close (this=0x7fe235a95ad0)
          at /usr/src/debug/impala-2.8.0-cdh5.12.0-SNAPSHOT/be/src/runtime/plan-fragment-executor.cc:501
      #11 0x0000000001a6cdb5 in impala::FragmentInstanceState::Exec (this=0x7fe235a95800)
          at /usr/src/debug/impala-2.8.0-cdh5.12.0-SNAPSHOT/be/src/runtime/fragment-instance-state.cc:71
      #12 0x0000000001a783e9 in impala::QueryExecMgr::ExecFInstance (this=0xe138100, fis=0x7fe235a95800)
          at /usr/src/debug/impala-2.8.0-cdh5.12.0-SNAPSHOT/be/src/runtime/query-exec-mgr.cc:110
      #13 0x0000000001a7b212 in boost::_mfi::mf1<void, impala::QueryExecMgr, impala::FragmentInstanceState*>::operator() (this=0x7fddb52eefa0, p=0xe138100, a1=0x7fe235a95800)
          at /usr/src/debug/impala-2.8.0-cdh5.12.0-SNAPSHOT/toolchain/boost-1.57.0-p1/include/boost/bind/mem_fn_template.hpp:165
      #14 0x0000000001a7b09b in boost::_bi::list2<boost::_bi::value<impala::QueryExecMgr*>, boost::_bi::value<impala::FragmentInstanceState*> >::operator()<boost::_mfi::mf1<void, impala::QueryExecMgr, impala::FragmentInstanceState*>, boost::_bi::list0> (this=0x7fddb52eefb0, f=..., a=...)
          at /usr/src/debug/impala-2.8.0-cdh5.12.0-SNAPSHOT/toolchain/boost-1.57.0-p1/include/boost/bind/bind.hpp:313
      #15 0x0000000001a7abcf in boost::_bi::bind_t<void, boost::_mfi::mf1<void, impala::QueryExecMgr, impala::FragmentInstanceState*>, boost::_bi::list2<boost::_bi::value<impala::QueryExecMgr*>, boost::_bi::value<impala::FragmentInstanceState*> > >::operator() (this=0x7fddb52eefa0)
          at /usr/src/debug/impala-2.8.0-cdh5.12.0-SNAPSHOT/toolchain/boost-1.57.0-p1/include/boost/bind/bind_template.hpp:20
      #16 0x0000000001a7a7da in boost::detail::function::void_function_obj_invoker0<boost::_bi::bind_t<void, boost::_mfi::mf1<void, impala::QueryExecMgr, impala::FragmentInstanceState*>, boost::_bi::list2<boost::_bi::value<impala::QueryExecMgr*>, boost::_bi::value<impala::FragmentInstanceState*> > >, void>::invoke (function_obj_ptr=...)
          at /usr/src/debug/impala-2.8.0-cdh5.12.0-SNAPSHOT/toolchain/boost-1.57.0-p1/include/boost/function/function_template.hpp:153
      #17 0x000000000137bd66 in boost::function0<void>::operator() (this=0x7fe13b699c40)
          at /usr/src/debug/impala-2.8.0-cdh5.12.0-SNAPSHOT/toolchain/boost-1.57.0-p1/include/boost/function/function_template.hpp:767
      #18 0x000000000162e35d in impala::Thread::SuperviseThread (name=..., category=..., functor=..., thread_started=0x7fd9fa94e9a0)
          at /usr/src/debug/impala-2.8.0-cdh5.12.0-SNAPSHOT/be/src/util/thread.cc:325
      #19 0x0000000001636d38 in boost::_bi::list4<boost::_bi::value<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::_bi::value<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::_bi::value<boost::function<void()> >, boost::_bi::value<impala::Promise<long int>*> >::operator()<void (*)(const std::basic_string<char>&, const std::basic_string<char>&, boost::function<void()>, impala::Promise<long int>*), boost::_bi::list0>(boost::_bi::type<void>, void (*&)(const std::basic_string<char, std::char_traits<char>, std::allocator<char> > &, const std::basic_string<char, std::char_traits<char>, std::allocator<char> > &, boost::function<void()>, impala::Promise<long> *), boost::_bi::list0 &, int) (this=0x1d6f83dc0, f=@0x1d6f83db8, a=...)
          at /usr/src/debug/impala-2.8.0-cdh5.12.0-SNAPSHOT/toolchain/boost-1.57.0-p1/include/boost/bind/bind.hpp:457
      #20 0x0000000001636c7b in boost::_bi::bind_t<void, void (*)(const std::basic_string<char, std::char_traits<char>, std::allocator<char> >&, const std::basic_string<char, std::char_traits<char>, std::allocator<char> >&, boost::function<void()>, impala::Promise<long int>*), boost::_bi::list4<boost::_bi::value<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::_bi::value<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::_bi::value<boost::function<void()> >, boost::_bi::value<impala::Promise<long int>*> > >::operator()(void) (this=0x1d6f83db8)
          at /usr/src/debug/impala-2.8.0-cdh5.12.0-SNAPSHOT/toolchain/boost-1.57.0-p1/include/boost/bind/bind_template.hpp:20
      #21 0x0000000001636c3e in boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(const std::basic_string<char, std::char_traits<char>, std::allocator<char> >&, const std::basic_string<char, std::char_traits<char>, std::allocator<char> >&, boost::function<void()>, impala::Promise<long int>*), boost::_bi::list4<boost::_bi::value<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::_bi::value<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::_bi::value<boost::function<void()> >, boost::_bi::value<impala::Promise<long int>*> > > >::run(void) (this=0x1d6f83c00)
          at /usr/src/debug/impala-2.8.0-cdh5.12.0-SNAPSHOT/toolchain/boost-1.57.0-p1/include/boost/thread/detail/thread.hpp:116
      #22 0x0000000001af270a in thread_proxy ()
      #23 0x000000323fa079d1 in start_thread (arg=0x7fe13b69a700) at pthread_create.c:301
      #24 0x000000323f6e88fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
      

      L280 looks like this:

       269 void HdfsParquetScanner::Close(RowBatch* row_batch) {
      ...
       280     for (ParquetColumnReader* col_reader : column_readers_) col_reader->Close(nullptr);
      

      Workaround

      • SET MT_DOP=0 before running COMPUTE STATS or SET MT_DOP=0 gobally
      • Refrain from setting MT_DOP!=0 manually for any query

        Activity

        Hide
        dhecht Dan Hecht added a comment -

        Hmm, looking at the column_readers_ vector, the vtbls don't look right (but the objects look about right since parent_ == this, etc):

        (gdb) p *(this->column_readers_._M_impl._M_start)@10
        $19 = {0x27dbedc0, 0x7fe05a566940, 0x7fd6c41c2c00, 0x159e462c0, 0x7fe1854d4dc0, 0x7fce7563cb00, 0x7fce7563d8c0, 0x7fce7563d600, 0x7fce7563db80, 0x7fe1189442c0}
        (gdb) p $19[0]
        $20 = (impala::ParquetColumnReader *) 0x27dbedc0
        (gdb) p *$19[0]
        $21 = {_vptr.ParquetColumnReader = 0x7fdcba1a5c80, parent_ = 0x7fdc12b96680, node_ = @0x13bbd8300, slot_desc_ = 0x7fe0e25fd200, pos_slot_desc_ = 0x0, pos_current_value_ = -1, rep_level_ = 0,
          max_rep_level_ = 0, def_level_ = -1, max_def_level_ = 1, tuple_offset_ = 0, null_indicator_offset_ = {byte_offset = 168, bit_mask = 1 '\001'}}
        (gdb) p *$19[1]
        $22 = {_vptr.ParquetColumnReader = 0x27dbedc0, parent_ = 0x7fdc12b96680, node_ = @0x13bbd8330, slot_desc_ = 0x7fe0e25fd290, pos_slot_desc_ = 0x0, pos_current_value_ = -1, rep_level_ = 0,
          max_rep_level_ = 0, def_level_ = -1, max_def_level_ = 1, tuple_offset_ = 8, null_indicator_offset_ = {byte_offset = 168, bit_mask = 2 '\002'}}
        (gdb) p *$19[2]
        $23 = {_vptr.ParquetColumnReader = 0x7fe05a566940, parent_ = 0x7fdc12b96680, node_ = @0x13bbd8360, slot_desc_ = 0x7fe0e25fd320, pos_slot_desc_ = 0x0, pos_current_value_ = -1, rep_level_ = 0,
          max_rep_level_ = 0, def_level_ = -1, max_def_level_ = 1, tuple_offset_ = 16, null_indicator_offset_ = {byte_offset = 168, bit_mask = 4 '\004'}}
        (gdb) info vtbl $19[0]
        vtable for 'impala::ParquetColumnReader' @ 0x7fdcba1a5c80 (subobject @ 0x27dbedc0):
        [0]: 0x7fcf4c8039c0
        [1]: 0x286
        [2]: 0x7fdafffffffe
        [3]: 0x656d676172462020
        [4]: 0x356234666520746e
        [5]: 0x6334393463313065
        [6]: 0x653130633a343663
        [7]: 0x3030303063616632
        [8]: 0x6f54203a64303030
        [9]: 0x382e38383d6c6174
        

        Looking at the obj_pool_ that they would live it, it looks like it was already cleared:

        (gdb) p obj_pool_.objects_._M_impl._M_start
        $24 = (std::_Vector_base<impala::ObjectPool::Element, std::allocator<impala::ObjectPool::Element> >::pointer) 0x7fde3ea17200
        (gdb) p obj_pool_.objects_._M_impl._M_finish
        $25 = (std::_Vector_base<impala::ObjectPool::Element, std::allocator<impala::ObjectPool::Element> >::pointer) 0x7fde3ea17200
        (gdb) p *obj_pool_.objects_._M_impl._M_start@10
        $26 = {{obj = 0x27dbedc0, delete_fn = 0x1768d36 <impala::ParquetColumnReader* impala::ObjectPool::Add<impala::ParquetColumnReader>(impala::ParquetColumnReader*)::{lambda(void*)#1}::_FUN(void*)>}, {
            obj = 0x7fe05a566940, delete_fn = 0x1768d36 <impala::ParquetColumnReader* impala::ObjectPool::Add<impala::ParquetColumnReader>(impala::ParquetColumnReader*)::{lambda(void*)#1}::_FUN(void*)>}, {
            obj = 0x7fd6c41c2c00, delete_fn = 0x1768d36 <impala::ParquetColumnReader* impala::ObjectPool::Add<impala::ParquetColumnReader>(impala::ParquetColumnReader*)::{lambda(void*)#1}::_FUN(void*)>}, {
            obj = 0x159e462c0, delete_fn = 0x1768d36 <impala::ParquetColumnReader* impala::ObjectPool::Add<impala::ParquetColumnReader>(impala::ParquetColumnReader*)::{lambda(void*)#1}::_FUN(void*)>}, {
            obj = 0x7fe1854d4dc0, delete_fn = 0x1768d36 <impala::ParquetColumnReader* impala::ObjectPool::Add<impala::ParquetColumnReader>(impala::ParquetColumnReader*)::{lambda(void*)#1}::_FUN(void*)>}, {
            obj = 0x7fce7563cb00, delete_fn = 0x1768d36 <impala::ParquetColumnReader* impala::ObjectPool::Add<impala::ParquetColumnReader>(impala::ParquetColumnReader*)::{lambda(void*)#1}::_FUN(void*)>}, {
            obj = 0x7fce7563d8c0, delete_fn = 0x1768d36 <impala::ParquetColumnReader* impala::ObjectPool::Add<impala::ParquetColumnReader>(impala::ParquetColumnReader*)::{lambda(void*)#1}::_FUN(void*)>}, {
            obj = 0x7fce7563d600, delete_fn = 0x1768d36 <impala::ParquetColumnReader* impala::ObjectPool::Add<impala::ParquetColumnReader>(impala::ParquetColumnReader*)::{lambda(void*)#1}::_FUN(void*)>}, {
            obj = 0x7fce7563db80, delete_fn = 0x1768d36 <impala::ParquetColumnReader* impala::ObjectPool::Add<impala::ParquetColumnReader>(impala::ParquetColumnReader*)::{lambda(void*)#1}::_FUN(void*)>}, {
            obj = 0x7fe1189442c0, delete_fn = 0x1768d36 <impala::ParquetColumnReader* impala::ObjectPool::Add<impala::ParquetColumnReader>(impala::ParquetColumnReader*)::{lambda(void*)#1}::_FUN(void*)>}}
        

        Alexander Behm, has the Close() path for scanners changed recently?

        Show
        dhecht Dan Hecht added a comment - Hmm, looking at the column_readers_ vector, the vtbls don't look right (but the objects look about right since parent_ == this, etc): (gdb) p *( this ->column_readers_._M_impl._M_start)@10 $19 = {0x27dbedc0, 0x7fe05a566940, 0x7fd6c41c2c00, 0x159e462c0, 0x7fe1854d4dc0, 0x7fce7563cb00, 0x7fce7563d8c0, 0x7fce7563d600, 0x7fce7563db80, 0x7fe1189442c0} (gdb) p $19[0] $20 = (impala::ParquetColumnReader *) 0x27dbedc0 (gdb) p *$19[0] $21 = {_vptr.ParquetColumnReader = 0x7fdcba1a5c80, parent_ = 0x7fdc12b96680, node_ = @0x13bbd8300, slot_desc_ = 0x7fe0e25fd200, pos_slot_desc_ = 0x0, pos_current_value_ = -1, rep_level_ = 0, max_rep_level_ = 0, def_level_ = -1, max_def_level_ = 1, tuple_offset_ = 0, null_indicator_offset_ = {byte_offset = 168, bit_mask = 1 '\001'}} (gdb) p *$19[1] $22 = {_vptr.ParquetColumnReader = 0x27dbedc0, parent_ = 0x7fdc12b96680, node_ = @0x13bbd8330, slot_desc_ = 0x7fe0e25fd290, pos_slot_desc_ = 0x0, pos_current_value_ = -1, rep_level_ = 0, max_rep_level_ = 0, def_level_ = -1, max_def_level_ = 1, tuple_offset_ = 8, null_indicator_offset_ = {byte_offset = 168, bit_mask = 2 '\002'}} (gdb) p *$19[2] $23 = {_vptr.ParquetColumnReader = 0x7fe05a566940, parent_ = 0x7fdc12b96680, node_ = @0x13bbd8360, slot_desc_ = 0x7fe0e25fd320, pos_slot_desc_ = 0x0, pos_current_value_ = -1, rep_level_ = 0, max_rep_level_ = 0, def_level_ = -1, max_def_level_ = 1, tuple_offset_ = 16, null_indicator_offset_ = {byte_offset = 168, bit_mask = 4 '\004'}} (gdb) info vtbl $19[0] vtable for 'impala::ParquetColumnReader' @ 0x7fdcba1a5c80 (subobject @ 0x27dbedc0): [0]: 0x7fcf4c8039c0 [1]: 0x286 [2]: 0x7fdafffffffe [3]: 0x656d676172462020 [4]: 0x356234666520746e [5]: 0x6334393463313065 [6]: 0x653130633a343663 [7]: 0x3030303063616632 [8]: 0x6f54203a64303030 [9]: 0x382e38383d6c6174 Looking at the obj_pool_ that they would live it, it looks like it was already cleared: (gdb) p obj_pool_.objects_._M_impl._M_start $24 = (std::_Vector_base<impala::ObjectPool::Element, std::allocator<impala::ObjectPool::Element> >::pointer) 0x7fde3ea17200 (gdb) p obj_pool_.objects_._M_impl._M_finish $25 = (std::_Vector_base<impala::ObjectPool::Element, std::allocator<impala::ObjectPool::Element> >::pointer) 0x7fde3ea17200 (gdb) p *obj_pool_.objects_._M_impl._M_start@10 $26 = {{obj = 0x27dbedc0, delete_fn = 0x1768d36 <impala::ParquetColumnReader* impala::ObjectPool::Add<impala::ParquetColumnReader>(impala::ParquetColumnReader*)::{lambda(void*)#1}::_FUN(void*)>}, { obj = 0x7fe05a566940, delete_fn = 0x1768d36 <impala::ParquetColumnReader* impala::ObjectPool::Add<impala::ParquetColumnReader>(impala::ParquetColumnReader*)::{lambda(void*)#1}::_FUN(void*)>}, { obj = 0x7fd6c41c2c00, delete_fn = 0x1768d36 <impala::ParquetColumnReader* impala::ObjectPool::Add<impala::ParquetColumnReader>(impala::ParquetColumnReader*)::{lambda(void*)#1}::_FUN(void*)>}, { obj = 0x159e462c0, delete_fn = 0x1768d36 <impala::ParquetColumnReader* impala::ObjectPool::Add<impala::ParquetColumnReader>(impala::ParquetColumnReader*)::{lambda(void*)#1}::_FUN(void*)>}, { obj = 0x7fe1854d4dc0, delete_fn = 0x1768d36 <impala::ParquetColumnReader* impala::ObjectPool::Add<impala::ParquetColumnReader>(impala::ParquetColumnReader*)::{lambda(void*)#1}::_FUN(void*)>}, { obj = 0x7fce7563cb00, delete_fn = 0x1768d36 <impala::ParquetColumnReader* impala::ObjectPool::Add<impala::ParquetColumnReader>(impala::ParquetColumnReader*)::{lambda(void*)#1}::_FUN(void*)>}, { obj = 0x7fce7563d8c0, delete_fn = 0x1768d36 <impala::ParquetColumnReader* impala::ObjectPool::Add<impala::ParquetColumnReader>(impala::ParquetColumnReader*)::{lambda(void*)#1}::_FUN(void*)>}, { obj = 0x7fce7563d600, delete_fn = 0x1768d36 <impala::ParquetColumnReader* impala::ObjectPool::Add<impala::ParquetColumnReader>(impala::ParquetColumnReader*)::{lambda(void*)#1}::_FUN(void*)>}, { obj = 0x7fce7563db80, delete_fn = 0x1768d36 <impala::ParquetColumnReader* impala::ObjectPool::Add<impala::ParquetColumnReader>(impala::ParquetColumnReader*)::{lambda(void*)#1}::_FUN(void*)>}, { obj = 0x7fe1189442c0, delete_fn = 0x1768d36 <impala::ParquetColumnReader* impala::ObjectPool::Add<impala::ParquetColumnReader>(impala::ParquetColumnReader*)::{lambda(void*)#1}::_FUN(void*)>}} Alexander Behm , has the Close() path for scanners changed recently?
        Hide
        alex.behm Alexander Behm added a comment -

        Dan Hecht I don't think the Close() path specifically has changed dramatically, but there have been may changes to the Parquet scanner (dictionary filtering and min/max filtering).

        Show
        alex.behm Alexander Behm added a comment - Dan Hecht I don't think the Close() path specifically has changed dramatically, but there have been may changes to the Parquet scanner (dictionary filtering and min/max filtering).
        Hide
        dhecht Dan Hecht added a comment - - edited

        This looks like it might be a bug in HdfsScanNodeMt::GetNext() to me. If CreateAndOpenScanner() returns error status, then scanner_ is left still set to the scanner, but that scanner was already closed by CreateAndOpenScanner(). Then, when the scan node is closed, the scanner will be closed again.

        There is some evidence in the log file of that too:

        I0407 12:59:17.356449 84767 status.cc:47] Memory limit exceeded
            @          0x1207940  impala::Status::Status()
            @          0x12076b0  impala::Status::MemLimitExceeded()
            @          0x13975b3  impala::MemTracker::MemLimitExceeded()
            @          0x17276c4  impala::HdfsParquetScanner::InitDictFilterStructures()
            @          0x17232ec  impala::HdfsParquetScanner::Open()
            @          0x16e1e1d  impala::HdfsScanNodeBase::CreateAndOpenScanner()
            @          0x16f8855  impala::HdfsScanNodeMt::GetNext()
            @          0x17a0c41  impala::PartitionedAggregationNode::Open()
            @          0x1a73242  impala::PlanFragmentExecutor::OpenInternal()
            @          0x1a72c7e  impala::PlanFragmentExecutor::Open()
            @          0x1a6cd3f  impala::FragmentInstanceState::Exec()
            @          0x1a783e9  impala::QueryExecMgr::ExecFInstance()
            @          0x1a7b212  boost::_mfi::mf1<>::operator()()
            @          0x1a7b09b  boost::_bi::list2<>::operator()<>()
            @          0x1a7abcf  boost::_bi::bind_t<>::operator()()
            @          0x1a7a7da  boost::detail::function::void_function_obj_invoker0<>::invoke()
            @          0x137bd66  boost::function0<>::operator()()
            @          0x162e35d  impala::Thread::SuperviseThread()
            @          0x1636d38  boost::_bi::list4<>::operator()<>()
            @          0x1636c7b  boost::_bi::bind_t<>::operator()()
            @          0x1636c3e  boost::detail::thread_data<>::run()
            @          0x1af270a  (unknown)
            @       0x323fa079d1  (unknown)
            @       0x323f6e88fd  (unknown)
        

        Though it is several seconds before the crash (but maybe that's just because of the extreme load).
        Alexander Behm, since this looks MT-scan-node specific, do you want to take it?

        Michael Brown, on another note, I noticed this in the log file. Maybe the data should be regenerated (how often do we do that?):

        I0407 12:57:05.306138 85140 status.cc:114] Corrupt Parquet file 'hdfs://vc0332.halxg.cloudera.com:8020/user/hive/warehouse/tpch_100_parquet.db/partsupp/3444dbb2ccec395e-45da764500000007_1009013170_data.0.parq': column 'ps_partkey' had 1024 remaining values but expected 0
        
        Show
        dhecht Dan Hecht added a comment - - edited This looks like it might be a bug in HdfsScanNodeMt::GetNext() to me. If CreateAndOpenScanner() returns error status, then scanner_ is left still set to the scanner, but that scanner was already closed by CreateAndOpenScanner() . Then, when the scan node is closed, the scanner will be closed again. There is some evidence in the log file of that too: I0407 12:59:17.356449 84767 status.cc:47] Memory limit exceeded @ 0x1207940 impala::Status::Status() @ 0x12076b0 impala::Status::MemLimitExceeded() @ 0x13975b3 impala::MemTracker::MemLimitExceeded() @ 0x17276c4 impala::HdfsParquetScanner::InitDictFilterStructures() @ 0x17232ec impala::HdfsParquetScanner::Open() @ 0x16e1e1d impala::HdfsScanNodeBase::CreateAndOpenScanner() @ 0x16f8855 impala::HdfsScanNodeMt::GetNext() @ 0x17a0c41 impala::PartitionedAggregationNode::Open() @ 0x1a73242 impala::PlanFragmentExecutor::OpenInternal() @ 0x1a72c7e impala::PlanFragmentExecutor::Open() @ 0x1a6cd3f impala::FragmentInstanceState::Exec() @ 0x1a783e9 impala::QueryExecMgr::ExecFInstance() @ 0x1a7b212 boost::_mfi::mf1<>:: operator ()() @ 0x1a7b09b boost::_bi::list2<>:: operator ()<>() @ 0x1a7abcf boost::_bi::bind_t<>:: operator ()() @ 0x1a7a7da boost::detail::function::void_function_obj_invoker0<>::invoke() @ 0x137bd66 boost::function0<>:: operator ()() @ 0x162e35d impala:: Thread ::SuperviseThread() @ 0x1636d38 boost::_bi::list4<>:: operator ()<>() @ 0x1636c7b boost::_bi::bind_t<>:: operator ()() @ 0x1636c3e boost::detail::thread_data<>::run() @ 0x1af270a (unknown) @ 0x323fa079d1 (unknown) @ 0x323f6e88fd (unknown) Though it is several seconds before the crash (but maybe that's just because of the extreme load). Alexander Behm , since this looks MT-scan-node specific, do you want to take it? Michael Brown , on another note, I noticed this in the log file. Maybe the data should be regenerated (how often do we do that?): I0407 12:57:05.306138 85140 status.cc:114] Corrupt Parquet file 'hdfs: //vc0332.halxg.cloudera.com:8020/user/hive/warehouse/tpch_100_parquet.db/partsupp/3444dbb2ccec395e-45da764500000007_1009013170_data.0.parq': column 'ps_partkey' had 1024 remaining values but expected 0
        Hide
        dhecht Dan Hecht added a comment -

        BTW, I think we have some end-to-end tests that exercise these error paths using debugactions. Are those not enabled for MT?

        Show
        dhecht Dan Hecht added a comment - BTW, I think we have some end-to-end tests that exercise these error paths using debugactions. Are those not enabled for MT?
        Hide
        mikesbrown Michael Brown added a comment -

        Dan Hecht wrote:

        Michael Brown, on another note, I noticed this in the log file. Maybe the data should be regenerated (how often do we do that?)

        Oh wow, good catch. We do it before every stress run except when doing an endurance test. In this case, the data was generated especially for this run, just before the run started.

        There's another run (with datagen) going now, and I'll report what happens there. Unfortunately, this also means the corrupt parquet file has been wiped.

        Show
        mikesbrown Michael Brown added a comment - Dan Hecht wrote: Michael Brown, on another note, I noticed this in the log file. Maybe the data should be regenerated (how often do we do that?) Oh wow, good catch. We do it before every stress run except when doing an endurance test. In this case, the data was generated especially for this run, just before the run started. There's another run (with datagen) going now, and I'll report what happens there. Unfortunately, this also means the corrupt parquet file has been wiped.
        Hide
        mikesbrown Michael Brown added a comment -

        There's another run (with datagen) going now, and I'll report what happens there. Unfortunately, this also means the corrupt parquet file has been wiped.

        Crash happened again, and corrupt Parquet files were seen there as well.

        Show
        mikesbrown Michael Brown added a comment - There's another run (with datagen) going now, and I'll report what happens there. Unfortunately, this also means the corrupt parquet file has been wiped. Crash happened again, and corrupt Parquet files were seen there as well.
        Hide
        alex.behm Alexander Behm added a comment -

        Dan Hecht, the scenario you found is a bug. We do not run all tests with MT_DOP (too many failures). Our existing failpoint tests do not run with MT_DOP because the tested query has a join. Will fix the bug and address the test gap.

        Show
        alex.behm Alexander Behm added a comment - Dan Hecht , the scenario you found is a bug. We do not run all tests with MT_DOP (too many failures). Our existing failpoint tests do not run with MT_DOP because the tested query has a join. Will fix the bug and address the test gap.
        Hide
        mikesbrown Michael Brown added a comment -

        I added the label broken-build to this Jira, because the stress test finds this with regularity, and early in the phase.

        Followup on the "Corrupt Parquet file" thing:

        I tried to run some queries by hand, including these from Impala's TPC-H workload:

        Q2, Q9, Q11, Q16, Q20

        I didn't see any messages matching "Corrupt Parquet file" when I ran them. But somehow the stress test is able to trigger the messages.

        $ grep "Corrupt Parquet file" impalad.* | wc -l
        60
        $ 
        

        Let me know if I need to do more investigation into this angle. Thanks.

        Show
        mikesbrown Michael Brown added a comment - I added the label broken-build to this Jira, because the stress test finds this with regularity, and early in the phase. Followup on the "Corrupt Parquet file" thing: I tried to run some queries by hand, including these from Impala's TPC-H workload: Q2, Q9, Q11, Q16, Q20 I didn't see any messages matching "Corrupt Parquet file" when I ran them. But somehow the stress test is able to trigger the messages. $ grep "Corrupt Parquet file" impalad.* | wc -l 60 $ Let me know if I need to do more investigation into this angle. Thanks.
        Hide
        mikesbrown Michael Brown added a comment - - edited

        For the above, Dan Hecht has asked for a separate bug: IMPALA-5197.

        Show
        mikesbrown Michael Brown added a comment - - edited For the above, Dan Hecht has asked for a separate bug: IMPALA-5197 .
        Hide
        alex.behm Alexander Behm added a comment -

        commit 491154c8ef7a9b6b1be2430c1bf119e21ce171c3
        Author: Alex Behm <alex.behm@cloudera.com>
        Date: Mon Apr 10 23:31:08 2017 -0700

        IMPALA-5186: Handle failed CreateAndOpenScanner() in MT scan.

        The bug was that a failed CreateAndOpenScanner() could cause
        a scanner to be closed twice leading to freed memory being
        accessed. The fix is straightforward.

        Testing:

        • I cleaned up test_failpoints.py and added an MT_DOP test
          dimension to cover this bug.
        • Core tests passed.

        Change-Id: I777c9b8ef2eb5b556c9b145d231c543b3b8ae270
        Reviewed-on: http://gerrit.cloudera.org:8080/6618
        Reviewed-by: Alex Behm <alex.behm@cloudera.com>
        Tested-by: Impala Public Jenkins

        Show
        alex.behm Alexander Behm added a comment - commit 491154c8ef7a9b6b1be2430c1bf119e21ce171c3 Author: Alex Behm <alex.behm@cloudera.com> Date: Mon Apr 10 23:31:08 2017 -0700 IMPALA-5186 : Handle failed CreateAndOpenScanner() in MT scan. The bug was that a failed CreateAndOpenScanner() could cause a scanner to be closed twice leading to freed memory being accessed. The fix is straightforward. Testing: I cleaned up test_failpoints.py and added an MT_DOP test dimension to cover this bug. Core tests passed. Change-Id: I777c9b8ef2eb5b556c9b145d231c543b3b8ae270 Reviewed-on: http://gerrit.cloudera.org:8080/6618 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins

          People

          • Assignee:
            alex.behm Alexander Behm
            Reporter:
            mikesbrown Michael Brown
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development