Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-3485

stress test crash: Tuple::SetNull() from UnnestNode::Open()

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Duplicate
    • Impala 2.6.0
    • None
    • Backend

    Description

      The nightly stress test caused and detected an impalad crash.

      There's a full cluster collection of logs, binaries, and core on impala-desktop, in the dev home directory with this Jira ID as the name.

      Note that this test has been running for weeks now and this is the first detected crash.

      Here's the backtrace:

      (gdb) bt
      #0  0x0000003fa8232625 in raise () from /lib64/libc.so.6
      #1  0x0000003fa8233e05 in abort () from /lib64/libc.so.6
      #2  0x00007f7f2b0edc55 in os::abort(bool) () from /opt/toolchain/sun-jdk-64bit-1.7.0.75/jre/lib/amd64/server/libjvm.so
      #3  0x00007f7f2b26fcd7 in VMError::report_and_die() () from /opt/toolchain/sun-jdk-64bit-1.7.0.75/jre/lib/amd64/server/libjvm.so
      #4  0x00007f7f2b0f2b6f in JVM_handle_linux_signal () from /opt/toolchain/sun-jdk-64bit-1.7.0.75/jre/lib/amd64/server/libjvm.so
      #5  <signal handler called>
      #6  0x0000000000c423d9 in SetNull (this=0x7f7c362e0380, state=0x156eff000) at /usr/src/debug/impala-2.6.0-cdh5.8.0-SNAPSHOT/be/src/runtime/tuple.h:124
      #7  impala::UnnestNode::Open (this=0x7f7c362e0380, state=0x156eff000) at /usr/src/debug/impala-2.6.0-cdh5.8.0-SNAPSHOT/be/src/exec/unnest-node.cc:100
      #8  0x0000000000c52e8f in impala::BlockingJoinNode::Open (this=0x7f6d2dbd0f80, state=0x156eff000) at /usr/src/debug/impala-2.6.0-cdh5.8.0-SNAPSHOT/be/src/exec/blocking-join-node.cc:209
      #9  0x0000000000c121da in impala::NestedLoopJoinNode::Open (this=0x7f6d2dbd0f80, state=0x156eff000) at /usr/src/debug/impala-2.6.0-cdh5.8.0-SNAPSHOT/be/src/exec/nested-loop-join-node.cc:61
      #10 0x0000000000c3addb in impala::SubplanNode::GetNext (this=0x7f7677b0adc0, state=0x156eff000, row_batch=0x7f76145da0a0, eos=0x7f6d2dbd0e61)
          at /usr/src/debug/impala-2.6.0-cdh5.8.0-SNAPSHOT/be/src/exec/subplan-node.cc:124
      #11 0x0000000000c52f29 in impala::BlockingJoinNode::Open (this=0x7f6d2dbd0d00, state=0x156eff000) at /usr/src/debug/impala-2.6.0-cdh5.8.0-SNAPSHOT/be/src/exec/blocking-join-node.cc:220
      #12 0x0000000000c121da in impala::NestedLoopJoinNode::Open (this=0x7f6d2dbd0d00, state=0x156eff000) at /usr/src/debug/impala-2.6.0-cdh5.8.0-SNAPSHOT/be/src/exec/nested-loop-join-node.cc:61
      #13 0x0000000000c3addb in impala::SubplanNode::GetNext (this=0x7f7351b5b760, state=0x156eff000, row_batch=0x7f76145db840, eos=0x7f734f769361)
          at /usr/src/debug/impala-2.6.0-cdh5.8.0-SNAPSHOT/be/src/exec/subplan-node.cc:124
      #14 0x0000000000c22ace in impala::PartitionedHashJoinNode::NextProbeRowBatch (this=0x7f734f769200, state=0x156eff000, out_batch=0x7f71fdf64360)
          at /usr/src/debug/impala-2.6.0-cdh5.8.0-SNAPSHOT/be/src/exec/partitioned-hash-join-node.cc:754
      #15 0x0000000000c2a57d in impala::PartitionedHashJoinNode::GetNext (this=Unhandled dwarf expression opcode 0xf3
      ) at /usr/src/debug/impala-2.6.0-cdh5.8.0-SNAPSHOT/be/src/exec/partitioned-hash-join-node.cc:981
      #16 0x0000000000c22ace in impala::PartitionedHashJoinNode::NextProbeRowBatch (this=0x7f78b6957b00, state=0x156eff000, out_batch=0x7f71fdf65d40)
          at /usr/src/debug/impala-2.6.0-cdh5.8.0-SNAPSHOT/be/src/exec/partitioned-hash-join-node.cc:754
      #17 0x0000000000c2a57d in impala::PartitionedHashJoinNode::GetNext (this=Unhandled dwarf expression opcode 0xf3
      ) at /usr/src/debug/impala-2.6.0-cdh5.8.0-SNAPSHOT/be/src/exec/partitioned-hash-join-node.cc:981
      #18 0x0000000000c22ace in impala::PartitionedHashJoinNode::NextProbeRowBatch (this=0x16815e880, state=0x156eff000, out_batch=0x7f71331f8760)
          at /usr/src/debug/impala-2.6.0-cdh5.8.0-SNAPSHOT/be/src/exec/partitioned-hash-join-node.cc:754
      #19 0x0000000000c2a57d in impala::PartitionedHashJoinNode::GetNext (this=Unhandled dwarf expression opcode 0xf3
      ) at /usr/src/debug/impala-2.6.0-cdh5.8.0-SNAPSHOT/be/src/exec/partitioned-hash-join-node.cc:981
      #20 0x0000000000c162ee in impala::PartitionedAggregationNode::GetRowsStreaming (this=Unhandled dwarf expression opcode 0xf3
      ) at /usr/src/debug/impala-2.6.0-cdh5.8.0-SNAPSHOT/be/src/exec/partitioned-aggregation-node.cc:483
      #21 0x0000000000c1b801 in impala::PartitionedAggregationNode::GetNext (this=0x7f7c01478000, state=0x156eff000, row_batch=0x7f7a52862120, eos=0x7f7a34057099)
          at /usr/src/debug/impala-2.6.0-cdh5.8.0-SNAPSHOT/be/src/exec/partitioned-aggregation-node.cc:385
      #22 0x0000000000d309cb in impala::PlanFragmentExecutor::GetNextInternal (this=0x7f7a34056f70, batch=0x7f73be176040) at /usr/src/debug/impala-2.6.0-cdh5.8.0-SNAPSHOT/be/src/runtime/plan-fragment-executor.cc:492
      #23 0x0000000000d30f5f in impala::PlanFragmentExecutor::OpenInternal (this=0x7f7a34056f70) at /usr/src/debug/impala-2.6.0-cdh5.8.0-SNAPSHOT/be/src/runtime/plan-fragment-executor.cc:365
      #24 0x0000000000d3175b in impala::PlanFragmentExecutor::Open (this=0x7f7a34056f70) at /usr/src/debug/impala-2.6.0-cdh5.8.0-SNAPSHOT/be/src/runtime/plan-fragment-executor.cc:328
      #25 0x0000000000ad7608 in impala::FragmentMgr::FragmentExecState::Exec (this=0x7f7a34056d00) at /usr/src/debug/impala-2.6.0-cdh5.8.0-SNAPSHOT/be/src/service/fragment-exec-state.cc:54
      #26 0x0000000000acefca in impala::FragmentMgr::FragmentThread (this=0xdb0fc80, fragment_instance_id=Unhandled dwarf expression opcode 0xf3
      ) at /usr/src/debug/impala-2.6.0-cdh5.8.0-SNAPSHOT/be/src/service/fragment-mgr.cc:86
      #27 0x0000000000ad02da in operator() (function_obj_ptr=Unhandled dwarf expression opcode 0xf3
      ) at /usr/src/debug/impala-2.6.0-cdh5.8.0-SNAPSHOT/toolchain/boost-1.57.0/include/boost/bind/mem_fn_template.hpp:165
      #28 operator()<boost::_mfi::mf1<void, impala::FragmentMgr, impala::TUniqueId>, boost::_bi::list0> (function_obj_ptr=Unhandled dwarf expression opcode 0xf3
      )   
          at /usr/src/debug/impala-2.6.0-cdh5.8.0-SNAPSHOT/toolchain/boost-1.57.0/include/boost/bind/bind.hpp:313
      #29 operator() (function_obj_ptr=Unhandled dwarf expression opcode 0xf3
      ) at /usr/src/debug/impala-2.6.0-cdh5.8.0-SNAPSHOT/toolchain/boost-1.57.0/include/boost/bind/bind_template.hpp:20
      #30 boost::detail::function::void_function_obj_invoker0<boost::_bi::bind_t<void, boost::_mfi::mf1<void, impala::FragmentMgr, impala::TUniqueId>, boost::_bi::list2<boost::_bi::value<impala::FragmentMgr*>, boost::_bi::value<impala::TUniqueId> > >, void>::invoke (function_obj_ptr=Unhandled dwarf expression opcode 0xf3
      ) at /usr/src/debug/impala-2.6.0-cdh5.8.0-SNAPSHOT/toolchain/boost-1.57.0/include/boost/function/function_template.hpp:153
      #31 0x0000000000b71dd7 in operator() (name=Unhandled dwarf expression opcode 0xf3
      ) at /usr/src/debug/impala-2.6.0-cdh5.8.0-SNAPSHOT/toolchain/boost-1.57.0/include/boost/function/function_template.hpp:767
      #32 impala::Thread::SuperviseThread (name=Unhandled dwarf expression opcode 0xf3
      ) at /usr/src/debug/impala-2.6.0-cdh5.8.0-SNAPSHOT/be/src/util/thread.cc:316
      #33 0x0000000000b72714 in operator()<void (*)(const std::basic_string<char>&, const std::basic_string<char>&, boost::function<void()>, impala::Promise<long int>*), boost::_bi::list0> (this=0x188d3cc00)
          at /usr/src/debug/impala-2.6.0-cdh5.8.0-SNAPSHOT/toolchain/boost-1.57.0/include/boost/bind/bind.hpp:457
      #34 operator() (this=0x188d3cc00) at /usr/src/debug/impala-2.6.0-cdh5.8.0-SNAPSHOT/toolchain/boost-1.57.0/include/boost/bind/bind_template.hpp:20
      #35 boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(const std::basic_string<char, std::char_traits<char>, std::allocator<char> >&, const std::basic_string<char, std::char_traits<char>, std::allocator<char> >&, boost::function<void()>, impala::Promise<long int>*), boost::_bi::list4<boost::_bi::value<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::_bi::value<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::_bi::value<boost::function<void()> >, boost::_bi::value<impala::Promise<long int>*> > > >::run(void) (this=0x188d3cc00)
          at /usr/src/debug/impala-2.6.0-cdh5.8.0-SNAPSHOT/toolchain/boost-1.57.0/include/boost/thread/detail/thread.hpp:116
      #36 0x0000000000dc02da in thread_proxy ()
      #37 0x0000003fa86079d1 in start_thread () from /lib64/libpthread.so.0
      #38 0x0000003fa82e88fd in clone () from /lib64/libc.so.6
      (gdb) 
      

      Attachments

        Issue Links

          Activity

            tarmstrong Tim Armstrong added a comment -

            The problem is that it's trying to dereference a bad tuple:

            100         tuple->SetNull(coll_slot_desc_->null_indicator_offset());
            

            The pointer is bad, and it comes from the subplan's input row.

            (gdb) p *tuple
            Cannot access memory at address 0x152fc030168fe03
            (gdb) p **containing_subplan_->current_input_row_->tuples_
            Cannot access memory at address 0x152fc030168fe03
            

            The subplan just copies the input row from its input batch.

                current_input_row_ = input_batch_->GetRow(input_row_idx_);
            

            The failing query was TPC-H nested query 7:

            # Q7 - Volume Shipping Query
            select
              supp_nation,
              cust_nation,
              l_year,
              sum(volume) as revenue
            from (
              select
                n1.n_name as supp_nation,
                n2.n_name as cust_nation,
                year(l_shipdate) as l_year,
                l_extendedprice * (1 - l_discount) as volume
              from
                customer c,
                c.c_orders o,
                o.o_lineitems l,
                supplier s,
                region.r_nations n1,
                region.r_nations n2
              where
                s_suppkey = l_suppkey
                and s_nationkey = n1.n_nationkey
                and c_nationkey = n2.n_nationkey
                and (
                  (n1.n_name = 'FRANCE' and n2.n_name = 'GERMANY')
                  or (n1.n_name = 'GERMANY' and n2.n_name = 'FRANCE')
                )
                and l_shipdate between '1995-01-01' and '1996-12-31'
              ) as shipping
            group by
              supp_nation,
              cust_nation,
              l_year
            order by
              supp_nation,
              cust_nation,
              l_year
            

            This has the plan (in the planner tests anyway):

            ---- DISTRIBUTEDPLAN
            22:MERGING-EXCHANGE [UNPARTITIONED]
            |  order by: supp_nation ASC, cust_nation ASC, l_year ASC
            |
            16:SORT
            |  order by: supp_nation ASC, cust_nation ASC, l_year ASC
            |
            21:AGGREGATE [FINALIZE]
            |  output: sum:merge(volume)
            |  group by: supp_nation, cust_nation, l_year
            |
            20:EXCHANGE [HASH(supp_nation,cust_nation,l_year)]
            |
            15:AGGREGATE [STREAMING]
            |  output: sum(l_extendedprice * (1 - l_discount))
            |  group by: n1.n_name, n2.n_name, year(l_shipdate)
            |
            14:HASH JOIN [INNER JOIN, BROADCAST]
            |  hash predicates: c_nationkey = n2.n_nationkey
            |  other predicates: ((n1.n_name = 'FRANCE' AND n2.n_name = 'GERMANY') OR (n1.n_name = 'GERMANY' AND n2.n_name = 'FRANCE'))
            |  runtime filters: RF000 <- n2.n_nationkey
            |
            |--19:EXCHANGE [BROADCAST]
            |  |
            |  11:SCAN HDFS [tpch_nested_parquet.region.r_nations n2]
            |     partitions=1/1 files=1 size=4.18KB
            |
            13:HASH JOIN [INNER JOIN, BROADCAST]
            |  hash predicates: s_nationkey = n1.n_nationkey
            |  runtime filters: RF001 <- n1.n_nationkey
            |
            |--18:EXCHANGE [BROADCAST]
            |  |
            |  10:SCAN HDFS [tpch_nested_parquet.region.r_nations n1]
            |     partitions=1/1 files=1 size=4.18KB
            |
            12:HASH JOIN [INNER JOIN, BROADCAST]
            |  hash predicates: l_suppkey = s_suppkey
            |
            |--17:EXCHANGE [BROADCAST]
            |  |
            |  09:SCAN HDFS [tpch_nested_parquet.supplier s]
            |     partitions=1/1 files=1 size=111.08MB
            |     runtime filters: RF001 -> s_nationkey
            |
            01:SUBPLAN
            |
            |--08:NESTED LOOP JOIN [CROSS JOIN]
            |  |
            |  |--02:SINGULAR ROW SRC
            |  |
            |  04:SUBPLAN
            |  |
            |  |--07:NESTED LOOP JOIN [CROSS JOIN]
            |  |  |
            |  |  |--05:SINGULAR ROW SRC
            |  |  |
            |  |  06:UNNEST [o.o_lineitems l]
            |  |
            |  03:UNNEST [c.c_orders o]
            |
            00:SCAN HDFS [tpch_nested_parquet.customer c]
               partitions=1/1 files=4 size=554.13MB
               predicates: !empty(c.c_orders)
               predicates on o: !empty(o.o_lineitems)
               predicates on l: l_shipdate >= '1995-01-01', l_shipdate <= '1996-12-31'
               runtime filters: RF000 -> c_nationkey
            

            So it seems like the bad row is somehow being produced by the parquet scan node.

            tarmstrong Tim Armstrong added a comment - The problem is that it's trying to dereference a bad tuple: 100 tuple->SetNull(coll_slot_desc_->null_indicator_offset()); The pointer is bad, and it comes from the subplan's input row. (gdb) p *tuple Cannot access memory at address 0x152fc030168fe03 (gdb) p **containing_subplan_->current_input_row_->tuples_ Cannot access memory at address 0x152fc030168fe03 The subplan just copies the input row from its input batch. current_input_row_ = input_batch_->GetRow(input_row_idx_); The failing query was TPC-H nested query 7: # Q7 - Volume Shipping Query select supp_nation, cust_nation, l_year, sum(volume) as revenue from ( select n1.n_name as supp_nation, n2.n_name as cust_nation, year(l_shipdate) as l_year, l_extendedprice * (1 - l_discount) as volume from customer c, c.c_orders o, o.o_lineitems l, supplier s, region.r_nations n1, region.r_nations n2 where s_suppkey = l_suppkey and s_nationkey = n1.n_nationkey and c_nationkey = n2.n_nationkey and ( (n1.n_name = 'FRANCE' and n2.n_name = 'GERMANY' ) or (n1.n_name = 'GERMANY' and n2.n_name = 'FRANCE' ) ) and l_shipdate between '1995-01-01' and '1996-12-31' ) as shipping group by supp_nation, cust_nation, l_year order by supp_nation, cust_nation, l_year This has the plan (in the planner tests anyway): ---- DISTRIBUTEDPLAN 22:MERGING-EXCHANGE [UNPARTITIONED] | order by: supp_nation ASC, cust_nation ASC, l_year ASC | 16:SORT | order by: supp_nation ASC, cust_nation ASC, l_year ASC | 21:AGGREGATE [FINALIZE] | output: sum:merge(volume) | group by: supp_nation, cust_nation, l_year | 20:EXCHANGE [HASH(supp_nation,cust_nation,l_year)] | 15:AGGREGATE [STREAMING] | output: sum(l_extendedprice * (1 - l_discount)) | group by: n1.n_name, n2.n_name, year(l_shipdate) | 14:HASH JOIN [INNER JOIN, BROADCAST] | hash predicates: c_nationkey = n2.n_nationkey | other predicates: ((n1.n_name = 'FRANCE' AND n2.n_name = 'GERMANY' ) OR (n1.n_name = 'GERMANY' AND n2.n_name = 'FRANCE' )) | runtime filters: RF000 <- n2.n_nationkey | |--19:EXCHANGE [BROADCAST] | | | 11:SCAN HDFS [tpch_nested_parquet.region.r_nations n2] | partitions=1/1 files=1 size=4.18KB | 13:HASH JOIN [INNER JOIN, BROADCAST] | hash predicates: s_nationkey = n1.n_nationkey | runtime filters: RF001 <- n1.n_nationkey | |--18:EXCHANGE [BROADCAST] | | | 10:SCAN HDFS [tpch_nested_parquet.region.r_nations n1] | partitions=1/1 files=1 size=4.18KB | 12:HASH JOIN [INNER JOIN, BROADCAST] | hash predicates: l_suppkey = s_suppkey | |--17:EXCHANGE [BROADCAST] | | | 09:SCAN HDFS [tpch_nested_parquet.supplier s] | partitions=1/1 files=1 size=111.08MB | runtime filters: RF001 -> s_nationkey | 01:SUBPLAN | |--08:NESTED LOOP JOIN [CROSS JOIN] | | | |--02:SINGULAR ROW SRC | | | 04:SUBPLAN | | | |--07:NESTED LOOP JOIN [CROSS JOIN] | | | | | |--05:SINGULAR ROW SRC | | | | | 06:UNNEST [o.o_lineitems l] | | | 03:UNNEST [c.c_orders o] | 00:SCAN HDFS [tpch_nested_parquet.customer c] partitions=1/1 files=4 size=554.13MB predicates: !empty(c.c_orders) predicates on o: !empty(o.o_lineitems) predicates on l: l_shipdate >= '1995-01-01' , l_shipdate <= '1996-12-31' runtime filters: RF000 -> c_nationkey So it seems like the bad row is somehow being produced by the parquet scan node.
            dhecht Daniel Hecht added a comment -

            What commit did the crash first show up at?

            dhecht Daniel Hecht added a comment - What commit did the crash first show up at?
            mikeb Michael Brown added a comment -

            The last build not to have the crash was b38a5cd0e85b7231ddf9d2495857f0e226a9f964, which was the previous night's run. The next build, which crashed, was e2998c52c37e669606583727793b40d984eb6085.

            $ git lg b38a5cd0e85b7231ddf9d2495857f0e226a9f964^..e2998c52c37e669606583727793b40d984eb6085
            * e2998c5 - IMPALA-2918: Unit test framework for simple scheduler (5 days ago) <Lars Volker>
            * 0225625 - IMPALA-3462: Fix exec option text for old HJ w/ runtime filters (5 days ago) <Henry Robinson>
            * a702271 - IMPALA-3286: Software prefetching for hash table build. (5 days ago) <Michael Ho>
            * 718b2a2 - Enable BOOST_NO_EXCEPTIONS for codegened code (6 days ago) <Tim Armstrong>
            * ba77f51 - Use unique_database fixture in test_compute_stats.py. (6 days ago) <Alex Behm>
            * 399cd13 - IMPALA-1583: Simplify PartitionedHashJoinNode::ProcessProbeBatch() (6 days ago) <Michael Ho>
            * 8931221 - IMPALA-2198: Differentiate queries in exceptional states in web UI (6 days ago) <Thomas Tauber-Marshall>
            * c60c410 - Rename FilesystemUtil::CreateDirectory() (6 days ago) <Lars Volker>
            * cac22dc - Strip global constructors and destructors from codegen module (6 days ago) <Tim Armstrong>
            * 12bf992 - IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems (7 days ago) <Sailesh Mukil>
            * 216d7f1 - IMPALA-3385: Fix crashes on accessing error_log (7 days ago) <Huaisi Xu>
            * 6d09f66 - IMPALA-3468: fix FindFirstInstance() SSE code to look for '\r' if necessary (7 days ago) <Skye Wanderman-Milne>
            * 91091bc - Ignore spurious MemPool DCHECK when using --disable_mem_pools flag (7 days ago) <Skye Wanderman-Milne>
            * 81371d5 - IMPALA-2736: Basic column-wise slot materialization in Parquet scanner. (7 days ago) <Alex Behm>
            * b38a5cd - IMPALA-3384: add missing frontend -> ext-data-source dependency. (7 days ago) <Misha Dmitriev>
            $ 
            

            Note too that a crash was not observed in runs over the weekend. I'll report if another crash is detected, of course.

            mikeb Michael Brown added a comment - The last build not to have the crash was b38a5cd0e85b7231ddf9d2495857f0e226a9f964, which was the previous night's run. The next build, which crashed, was e2998c52c37e669606583727793b40d984eb6085. $ git lg b38a5cd0e85b7231ddf9d2495857f0e226a9f964^..e2998c52c37e669606583727793b40d984eb6085 * e2998c5 - IMPALA-2918: Unit test framework for simple scheduler (5 days ago) <Lars Volker> * 0225625 - IMPALA-3462: Fix exec option text for old HJ w/ runtime filters (5 days ago) <Henry Robinson> * a702271 - IMPALA-3286: Software prefetching for hash table build. (5 days ago) <Michael Ho> * 718b2a2 - Enable BOOST_NO_EXCEPTIONS for codegened code (6 days ago) <Tim Armstrong> * ba77f51 - Use unique_database fixture in test_compute_stats.py. (6 days ago) <Alex Behm> * 399cd13 - IMPALA-1583: Simplify PartitionedHashJoinNode::ProcessProbeBatch() (6 days ago) <Michael Ho> * 8931221 - IMPALA-2198: Differentiate queries in exceptional states in web UI (6 days ago) <Thomas Tauber-Marshall> * c60c410 - Rename FilesystemUtil::CreateDirectory() (6 days ago) <Lars Volker> * cac22dc - Strip global constructors and destructors from codegen module (6 days ago) <Tim Armstrong> * 12bf992 - IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems (7 days ago) <Sailesh Mukil> * 216d7f1 - IMPALA-3385: Fix crashes on accessing error_log (7 days ago) <Huaisi Xu> * 6d09f66 - IMPALA-3468: fix FindFirstInstance() SSE code to look for '\r' if necessary (7 days ago) <Skye Wanderman-Milne> * 91091bc - Ignore spurious MemPool DCHECK when using --disable_mem_pools flag (7 days ago) <Skye Wanderman-Milne> * 81371d5 - IMPALA-2736: Basic column-wise slot materialization in Parquet scanner. (7 days ago) <Alex Behm> * b38a5cd - IMPALA-3384: add missing frontend -> ext-data-source dependency. (7 days ago) <Misha Dmitriev> $ Note too that a crash was not observed in runs over the weekend. I'll report if another crash is detected, of course.
            tarmstrong Tim Armstrong added a comment -

            The only likely candidate is the parquet scanner change, but I've done a pass over that code and can't see how it would return a bogus row. I'm trying to reproduce this locally but no luck so far.

            tarmstrong Tim Armstrong added a comment - The only likely candidate is the parquet scanner change, but I've done a pass over that code and can't see how it would return a bogus row. I'm trying to reproduce this locally but no luck so far.
            alex.behm Alexander Behm added a comment -

            I examined the core file and I think it's plausible that this issue has been caused by IMPALA-3528. The following evidence supports the claim:

            • query_status_ in the runtime state is set to MEM_LIMIT_EXCEEDED without further error details
            • the only I place I know of that sets MEM_LIMIT_EXCEEDED without details is disk-io-mgr, so it's very likely that a scanner thread got terminated unexpectedly/early
            • the crash is in UnnestNode::Open() does not do QueryMaintenance() or check for query cancellation (cancelled_ in the runtime stats was false though)
            • the problem in UnnestNode::Open() is that a "bad" Tuple* is dereferenced. The Tuple* looks valid because it is consistent with the addresses that other Tuple* in the same row batch point to. Therefore, it's plausible that those tuples pointed to valid memory at some point in time (as opposed to the Tuple* being completely garbage/bogus)
            • the MemPool of the containing row batch has no MemChunks at all which is also consistent with IMPALA-3528 (but could also have other reasons)
            alex.behm Alexander Behm added a comment - I examined the core file and I think it's plausible that this issue has been caused by IMPALA-3528 . The following evidence supports the claim: query_status_ in the runtime state is set to MEM_LIMIT_EXCEEDED without further error details the only I place I know of that sets MEM_LIMIT_EXCEEDED without details is disk-io-mgr, so it's very likely that a scanner thread got terminated unexpectedly/early the crash is in UnnestNode::Open() does not do QueryMaintenance() or check for query cancellation (cancelled_ in the runtime stats was false though) the problem in UnnestNode::Open() is that a "bad" Tuple* is dereferenced. The Tuple* looks valid because it is consistent with the addresses that other Tuple* in the same row batch point to. Therefore, it's plausible that those tuples pointed to valid memory at some point in time (as opposed to the Tuple* being completely garbage/bogus) the MemPool of the containing row batch has no MemChunks at all which is also consistent with IMPALA-3528 (but could also have other reasons)
            tarmstrong Tim Armstrong added a comment -

            It's possible that TCMalloc is aggressively unmapping memory at the time when this is all going on (we force it to do this when we're at the process mem limit).

            It sounds like it's quite likely fixed, so maybe we should close this out and continue monitoring the stress test for similar failure?

            tarmstrong Tim Armstrong added a comment - It's possible that TCMalloc is aggressively unmapping memory at the time when this is all going on (we force it to do this when we're at the process mem limit). It sounds like it's quite likely fixed, so maybe we should close this out and continue monitoring the stress test for similar failure?
            dhecht Daniel Hecht added a comment -

            Yes, let's assume it's a dup of IMPALA-3528. Unfortunately we can't verify because we were never able to establish a repro rate.

            dhecht Daniel Hecht added a comment - Yes, let's assume it's a dup of IMPALA-3528 . Unfortunately we can't verify because we were never able to establish a repro rate.

            People

              tarmstrong Tim Armstrong
              mikeb Michael Brown
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: