Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-5173

Crash when NestedLoopJoin has HashJoin feeding directly into it's right side

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: Impala 2.7.0, Impala 2.8.0, Impala 2.9.0
    • Fix Version/s: Impala 2.9.0
    • Component/s: Backend
    • Labels:
      None

      Description

      The repro is:

      use tpch_parquet;
      set num_nodes=1;
      set max_block_mgr_memory=300m;
      select straight_join * 
      from (select * from orders limit 1) o,
         (select l1.* from lineitem l1 inner join lineitem l2 on l1.l_orderkey = l2.l_orderkey) v;
      

      Plan:

      +-------------------------------------------------------+
      | Explain String                                        |
      +-------------------------------------------------------+
      | Estimated Per-Host Requirements: Memory=0B VCores=0   |
      |                                                       |
      | PLAN-ROOT SINK                                        |
      | |                                                     |
      | 04:NESTED LOOP JOIN [CROSS JOIN]                      |
      | |  hosts=2 per-host-mem=unavailable                   |
      | |  tuple-ids=0,2,3 row-size=462B cardinality=6001215  |
      | |                                                     |
      | |--03:HASH JOIN [INNER JOIN]                          |
      | |  |  hash predicates: l1.l_orderkey = l2.l_orderkey  |
      | |  |  runtime filters: RF000 <- l2.l_orderkey         |
      | |  |  hosts=3 per-host-mem=unavailable                |
      | |  |  tuple-ids=2,3 row-size=271B cardinality=6001215 |
      | |  |                                                  |
      | |  |--02:SCAN HDFS [tpch_parquet.lineitem l2]         |
      | |  |     partitions=1/1 files=3 size=193.61MB         |
      | |  |     table stats: 6001215 rows total              |
      | |  |     column stats: all                            |
      | |  |     hosts=3 per-host-mem=unavailable             |
      | |  |     tuple-ids=3 row-size=8B cardinality=6001215  |
      | |  |                                                  |
      | |  01:SCAN HDFS [tpch_parquet.lineitem l1]            |
      | |     partitions=1/1 files=3 size=193.61MB            |
      | |     runtime filters: RF000 -> l1.l_orderkey         |
      | |     table stats: 6001215 rows total                 |
      | |     column stats: all                               |
      | |     hosts=3 per-host-mem=unavailable                |
      | |     tuple-ids=2 row-size=263B cardinality=6001215   |
      | |                                                     |
      | 00:SCAN HDFS [tpch_parquet.orders]                    |
      |    partitions=1/1 files=2 size=54.00MB                |
      |    table stats: 1500000 rows total                    |
      |    column stats: all                                  |
      |    limit: 1                                           |
      |    hosts=2 per-host-mem=unavailable                   |
      |    tuple-ids=0 row-size=191B cardinality=1            |
      +-------------------------------------------------------+
      
      

      Summary before crash:

      +---------------------+--------+----------+----------+--------+------------+-----------+---------------+--------------------------+
      | Operator            | #Hosts | Avg Time | Max Time | #Rows  | Est. #Rows | Peak Mem  | Est. Peak Mem | Detail                   |
      +---------------------+--------+----------+----------+--------+------------+-----------+---------------+--------------------------+
      | 04:NESTED LOOP JOIN | 1      | 0ns      | 0ns      | 0      | 6.00M      | 6.19 GB   | -1 B          | CROSS JOIN               |
      | |--03:HASH JOIN     | 1      | 4.17s    | 4.17s    | 13.28M | 6.00M      | 312.07 MB | -1 B          | INNER JOIN               |
      | |  |--02:SCAN HDFS  | 1      | 33.74ms  | 33.74ms  | 6.00M  | 6.00M      | 14.98 MB  | -1 B          | tpch_parquet.lineitem l2 |
      | |  01:SCAN HDFS     | 1      | 795.87ms | 795.87ms | 6.00M  | 6.00M      | 259.72 MB | -1 B          | tpch_parquet.lineitem l1 |
      | 00:SCAN HDFS        | 1      | 291.46us | 291.46us | 0      | 1          | 0 B       | -1 B          | tpch_parquet.orders      |
      +---------------------+--------+----------+----------+--------+------------+-----------+---------------+--------------------------+
      

      This DCHECK is hit:

      F0405 16:02:34.806744 21512 mem-tracker.cc:194] Check failed: consumption_->current_value() == 0 (17367040 vs. 0) Hash Join Builder (join_node_id=3)
          @          0x1396425  impala::MemTracker::~MemTracker()
          @          0x137ee3c  boost::checked_delete<>()
          @          0x137cd15  boost::scoped_ptr<>::~scoped_ptr()
          @          0x137d23f  boost::scoped_ptr<>::reset()
          @          0x1855be7  impala::DataSink::Close()
          @          0x184692d  impala::PhjBuilder::Close()
          @          0x17c1190  impala::PartitionedHashJoinNode::Close()
          @          0x1823edd  impala::BlockingJoinNode::ProcessBuildInputAsync()
          @          0x18276df  boost::_mfi::mf3<>::operator()()
          @          0x182758b  boost::_bi::list4<>::operator()<>()
          @          0x18272d7  boost::_bi::bind_t<>::operator()()
          @          0x1826fd0  boost::detail::function::void_function_obj_invoker0<>::invoke()
          @          0x137bd48  boost::function0<>::operator()()
          @          0x162e363  impala::Thread::SuperviseThread()
          @          0x1636d3e  boost::_bi::list4<>::operator()<>()
          @          0x1636c81  boost::_bi::bind_t<>::operator()()
          @          0x1636c44  boost::detail::thread_data<>::run()
          @          0x1af270a  thread_proxy
          @     0x7fc2538ff6ba  start_thread
          @     0x7fc25363582d  clone
      
      Hash Join Builder (join_node_id=3): Total=16.56 MB Peak=310.08 MB
      *** Check failure stack trace: ***
          @          0x28c208d  google::LogMessage::Fail()
          @          0x28c3932  google::LogMessage::SendToLog()
          @          0x28c1a67  google::LogMessage::Flush()
          @          0x28c502e  google::LogMessageFatal::~LogMessageFatal()
          @          0x13964c7  impala::MemTracker::~MemTracker()
          @          0x137ee3c  boost::checked_delete<>()
          @          0x137cd15  boost::scoped_ptr<>::~scoped_ptr()
          @          0x137d23f  boost::scoped_ptr<>::reset()
          @          0x1855be7  impala::DataSink::Close()
          @          0x184692d  impala::PhjBuilder::Close()
          @          0x17c1190  impala::PartitionedHashJoinNode::Close()
          @          0x1823edd  impala::BlockingJoinNode::ProcessBuildInputAsync()
          @          0x18276df  boost::_mfi::mf3<>::operator()()
          @          0x182758b  boost::_bi::list4<>::operator()<>()
          @          0x18272d7  boost::_bi::bind_t<>::operator()()
          @          0x1826fd0  boost::detail::function::void_function_obj_invoker0<>::invoke()
          @          0x137bd48  boost::function0<>::operator()()
          @          0x162e363  impala::Thread::SuperviseThread()
          @          0x1636d3e  boost::_bi::list4<>::operator()<>()
          @          0x1636c81  boost::_bi::bind_t<>::operator()()
          @          0x1636c44  boost::detail::thread_data<>::run()
          @          0x1af270a  thread_proxy
          @     0x7fc2538ff6ba  start_thread
          @     0x7fc25363582d  clone
      Picked up JAVA_TOOL_OPTIONS: -agentlib:jdwp=transport=dt_socket,address=30000,server=y,suspend=n 
      Wrote minidump to /home/tarmstrong/Impala/incubator-impala/logs/cluster/minidumps/impalad/25a8ecae-b4e1-9645-449882d5-58049f7f.dmp
      

      Backtrace:

      (gdb) bt
      #0  0x00007fc253564428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
      #1  0x00007fc25356602a in __GI_abort () at abort.c:89
      #2  0x00000000028cb614 in google::DumpStackTraceAndExit() ()
      #3  0x00000000028c208d in google::LogMessage::Fail() ()
      #4  0x00000000028c3932 in google::LogMessage::SendToLog() ()
      #5  0x00000000028c1a67 in google::LogMessage::Flush() ()
      #6  0x00000000028c502e in google::LogMessageFatal::~LogMessageFatal() ()
      #7  0x00000000013964c7 in impala::MemTracker::~MemTracker (this=0x9b1c780, __in_chrg=<optimized out>) at be/src/runtime/mem-tracker.cc:194
      #8  0x000000000137ee3c in boost::checked_delete<impala::MemTracker> (x=0x9b1c780) at /opt/Impala-Toolchain/boost-1.57.0-p1/include/boost/core/checked_delete.hpp:34
      #9  0x000000000137cd15 in boost::scoped_ptr<impala::MemTracker>::~scoped_ptr (this=0x7fc1b82607c0, __in_chrg=<optimized out>) at /opt/Impala-Toolchain/boost-1.57.0-p1/include/boost/smart_ptr/scoped_ptr.hpp:82
      #10 0x000000000137d23f in boost::scoped_ptr<impala::MemTracker>::reset (this=0x850d6e0, p=0x0) at /opt/Impala-Toolchain/boost-1.57.0-p1/include/boost/smart_ptr/scoped_ptr.hpp:88
      #11 0x0000000001855be7 in impala::DataSink::Close (this=0x850d6c0, state=0x81b1800) at be/src/exec/data-sink.cc:197
      #12 0x000000000184692d in impala::PhjBuilder::Close (this=0x850d6c0, state=0x81b1800) at be/src/exec/partitioned-hash-join-builder.cc:235
      #13 0x00000000017c1190 in impala::PartitionedHashJoinNode::Close (this=0x880ae00, state=0x81b1800) at be/src/exec/partitioned-hash-join-node.cc:248
      #14 0x0000000001823edd in impala::BlockingJoinNode::ProcessBuildInputAsync (this=0xda33e00, state=0x81b1800, build_sink=0x8d118e0, status=0x7fc1baa6ba00) at be/src/exec/blocking-join-node.cc:161
      #15 0x00000000018276df in boost::_mfi::mf3<void, impala::BlockingJoinNode, impala::RuntimeState*, impala::DataSink*, impala::Promise<impala::Status>*>::operator() (this=0x8e19b60, p=0xda33e00, a1=0x81b1800, a2=0x8d118e0,
          a3=0x7fc1baa6ba00) at /opt/Impala-Toolchain/boost-1.57.0-p1/include/boost/bind/mem_fn_template.hpp:393
      #16 0x000000000182758b in boost::_bi::list4<boost::_bi::value<impala::BlockingJoinNode*>, boost::_bi::value<impala::RuntimeState*>, boost::_bi::value<impala::DataSink*>, boost::_bi::value<impala::Promise<impala::Status>*> >::operator()<boost::_mfi::mf3<void, impala::BlockingJoinNode, impala::RuntimeState*, impala::DataSink*, impala::Promise<impala::Status>*>, boost::_bi::list0> (this=0x8e19b70, f=..., a=...)
          at /opt/Impala-Toolchain/boost-1.57.0-p1/include/boost/bind/bind.hpp:457
      #17 0x00000000018272d7 in boost::_bi::bind_t<void, boost::_mfi::mf3<void, impala::BlockingJoinNode, impala::RuntimeState*, impala::DataSink*, impala::Promise<impala::Status>*>, boost::_bi::list4<boost::_bi::value<impala::BlockingJoinNode*>, boost::_bi::value<impala::RuntimeState*>, boost::_bi::value<impala::DataSink*>, boost::_bi::value<impala::Promise<impala::Status>*> > >::operator() (this=0x8e19b60)
          at /opt/Impala-Toolchain/boost-1.57.0-p1/include/boost/bind/bind_template.hpp:20
      #18 0x0000000001826fd0 in boost::detail::function::void_function_obj_invoker0<boost::_bi::bind_t<void, boost::_mfi::mf3<void, impala::BlockingJoinNode, impala::RuntimeState*, impala::DataSink*, impala::Promise<impala::Status>*>, boost::_bi::list4<boost::_bi::value<impala::BlockingJoinNode*>, boost::_bi::value<impala::RuntimeState*>, boost::_bi::value<impala::DataSink*>, boost::_bi::value<impala::Promise<impala::Status>*> > >, void>::invoke
          (function_obj_ptr=...) at /opt/Impala-Toolchain/boost-1.57.0-p1/include/boost/function/function_template.hpp:153
      #19 0x000000000137bd48 in boost::function0<void>::operator() (this=0x7fc1b8260d60) at /opt/Impala-Toolchain/boost-1.57.0-p1/include/boost/function/function_template.hpp:767
      #20 0x000000000162e363 in impala::Thread::SuperviseThread(std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>*) (name="build thread", category="NestedLoopJoinNode", functor=...,
          thread_started=0x7fc1baa6b7f0) at be/src/util/thread.cc:325
      #21 0x0000000001636d3e in boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, boost::_bi::value<impala::Promise<long>*> >::operator()<void (*)(std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>*), boost::_bi::list0>(boost::_bi::type<void>, void (*&)(std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>*), boost::_bi::list0&, int) (this=0xda309c0, f=@0xda309b8: 0x162e09e <impala::Thread::SuperviseThread(std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>*)>, a=...)
          at /opt/Impala-Toolchain/boost-1.57.0-p1/include/boost/bind/bind.hpp:457
      #22 0x0000000001636c81 in boost::_bi::bind_t<void, void (*)(std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>*), boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, boost::_bi::value<impala::Promise<long>*> > >::operator()() (this=0xda309b8) at /opt/Impala-Toolchain/boost-1.57.0-p1/include/boost/bind/bind_template.hpp:20
      #23 0x0000000001636c44 in boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>*), boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, boost::_bi::value<impala::Promise<long>*> > > >::run() (this=0xda30800)
          at /opt/Impala-Toolchain/boost-1.57.0-p1/include/boost/thread/detail/thread.hpp:116
      #24 0x0000000001af270a in thread_proxy ()
      #25 0x00007fc2538ff6ba in start_thread (arg=0x7fc1b8261700) at pthread_create.c:333
      #26 0x00007fc25363582d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
      

      The root cause (I believe) is the nested loop join accumulating BufferedBlockMgr::Blocks attached to row batches.

        Activity

        Hide
        tarmstrong Tim Armstrong added a comment -

        Downgrading. After understanding it more I think this isn't severe enough to be a blocker, because:

        • The plan shapes are pretty unusual
        • It's hitting a DCHECK but in a release build it would manifest as a query failure due to OOM
        Show
        tarmstrong Tim Armstrong added a comment - Downgrading. After understanding it more I think this isn't severe enough to be a blocker, because: The plan shapes are pretty unusual It's hitting a DCHECK but in a release build it would manifest as a query failure due to OOM
        Hide
        tarmstrong Tim Armstrong added a comment -

        IMPALA-5173: crash with hash join feeding directly into nlj

        The background for this bug is that we can't transfer ownership
        of BufferdBlockMgr::Blocks that are attached to RowBatches.

        The NestedLoopJoinNode accumulates row batches on its right
        side and tries to take ownership of the memory, which doesn't
        work as expected in this case.

        The fix is to copy the data when we encounter one of these
        (likely very rare) cases.

        Testing:
        Added a regression test that produces a crash before the fix and
        succeeds after the fix.

        Change-Id: I0c04952e591d17e5ff7e994884be4c4c899ae192
        Reviewed-on: http://gerrit.cloudera.org:8080/6568
        Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
        Tested-by: Impala Public Jenkins

        Show
        tarmstrong Tim Armstrong added a comment - IMPALA-5173 : crash with hash join feeding directly into nlj The background for this bug is that we can't transfer ownership of BufferdBlockMgr::Blocks that are attached to RowBatches. The NestedLoopJoinNode accumulates row batches on its right side and tries to take ownership of the memory, which doesn't work as expected in this case. The fix is to copy the data when we encounter one of these (likely very rare) cases. Testing: Added a regression test that produces a crash before the fix and succeeds after the fix. Change-Id: I0c04952e591d17e5ff7e994884be4c4c899ae192 Reviewed-on: http://gerrit.cloudera.org:8080/6568 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins

          People

          • Assignee:
            tarmstrong Tim Armstrong
            Reporter:
            tarmstrong Tim Armstrong
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development