Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-2252

Crash (likely race) tearing down BufferedBlockMgr on query failure

    XMLWordPrintableJSON

Details

    Description

      When running a heavy workload on a 6-node cluster (happened to have RM enabled, but another cluster repro'd without RM but with Kerberos), a number of impalads crashed while tearing down the BufferedBlockMgr.

      The impalad crashes with the following stack:

      #9  impala::ErrorCount (errors=Cannot access memory at address 0xa00000008018
      ) at /data/3/jenkins/workspace/impala-master-64bit-PRIVATE-fast/repos/Impala/be/src/util/error-util.cc:192
      #10 0x0000000000976f92 in impala::RuntimeState::LogError (this=0x8bc3db00, message=...) at /data/3/jenkins/workspace/impala-master-64bit-PRIVATE-fast/repos/Impala/be/src/runtime/runtime-state.cc:229
      #11 0x00000000009886ae in impala::BufferedBlockMgr::WriteComplete (this=0x4931eb000, block=<value optimized out>, write_status=...) at /data/3/jenkins/workspace/impala-master-64bit-PRIVATE-fast/repos/Impala/be/src/runtime/buffered-block-mgr.cc:778
      #12 0x000000000099e930 in operator() (this=0x46b80f00, status=...) at /usr/include/boost/function/function_template.hpp:1013
      #13 impala::DiskIoMgr::RequestContext::Cancel (this=0x46b80f00, status=...) at /data/3/jenkins/workspace/impala-master-64bit-PRIVATE-fast/repos/Impala/be/src/runtime/disk-io-mgr-reader-context.cc:83
      #14 0x000000000099553c in impala::DiskIoMgr::CancelContext (this=<value optimized out>, context=0x46b80f00, wait_for_disks_completion=true) at /data/3/jenkins/workspace/impala-master-64bit-PRIVATE-fast/repos/Impala/be/src/runtime/disk-io-mgr.cc:420
      #15 0x00000000009955ed in impala::DiskIoMgr::UnregisterContext (this=0x599f200, reader=0x46b80f00) at /data/3/jenkins/workspace/impala-master-64bit-PRIVATE-fast/repos/Impala/be/src/runtime/disk-io-mgr.cc:388
      #16 0x0000000000989821 in impala::BufferedBlockMgr::~BufferedBlockMgr (this=0x4931eb000, __in_chrg=<value optimized out>) at /data/3/jenkins/workspace/impala-master-64bit-PRIVATE-fast/repos/Impala/be/src/runtime/buffered-block-mgr.cc:495
      #17 0x000000000098f162 in checked_delete<impala::BufferedBlockMgr> (this=<value optimized out>) at /usr/include/boost/checked_delete.hpp:34
      #18 boost::detail::sp_counted_impl_p<impala::BufferedBlockMgr>::dispose (this=<value optimized out>) at /usr/include/boost/smart_ptr/detail/sp_counted_impl.hpp:78
      #19 0x0000000000979335 in release (this=0xdaf52900, __in_chrg=<value optimized out>) at /usr/include/boost/smart_ptr/detail/sp_counted_base_gcc_x86.hpp:145
      #20 ~shared_count (this=0xdaf52900, __in_chrg=<value optimized out>) at /usr/include/boost/smart_ptr/detail/shared_count.hpp:217
      #21 ~shared_ptr (this=0xdaf52900, __in_chrg=<value optimized out>) at /usr/include/boost/smart_ptr/shared_ptr.hpp:169
      #22 reset (this=0xdaf52900, __in_chrg=<value optimized out>) at /usr/include/boost/smart_ptr/shared_ptr.hpp:386
      #23 impala::RuntimeState::~RuntimeState (this=0xdaf52900, __in_chrg=<value optimized out>) at /data/3/jenkins/workspace/impala-master-64bit-PRIVATE-fast/repos/Impala/be/src/runtime/runtime-state.cc:95
      #24 0x0000000000be78b1 in checked_delete<impala::RuntimeState> (this=0xdaf55810, __in_chrg=<value optimized out>) at /usr/include/boost/checked_delete.hpp:34
      #25 ~scoped_ptr (this=0xdaf55810, __in_chrg=<value optimized out>) at /usr/include/boost/smart_ptr/scoped_ptr.hpp:80
      #26 impala::PlanFragmentExecutor::~PlanFragmentExecutor (this=0xdaf55810, __in_chrg=<value optimized out>) at /data/3/jenkins/workspace/impala-master-64bit-PRIVATE-fast/repos/Impala/be/src/runtime/plan-fragment-executor.cc:78
      #27 0x0000000000a2efee in ~FragmentExecState (x=0xdaf55600) at /data/3/jenkins/workspace/impala-master-64bit-PRIVATE-fast/repos/Impala/be/src/service/fragment-exec-state.h:42
      #28 boost::checked_delete<impala::FragmentMgr::FragmentExecState> (x=0xdaf55600) at /usr/include/boost/checked_delete.hpp:34
      #29 0x00000000007766a9 in release (this=<value optimized out>, __in_chrg=<value optimized out>) at /usr/include/boost/smart_ptr/detail/sp_counted_base_gcc_x86.hpp:145
      #30 boost::detail::shared_count::~shared_count (this=<value optimized out>, __in_chrg=<value optimized out>) at /usr/include/boost/smart_ptr/detail/shared_count.hpp:217
      #31 0x0000000000a2d1d8 in ~shared_ptr (this=0x7273b00, exec_state=0xdaf55600) at /usr/include/boost/smart_ptr/shared_ptr.hpp:169
      #32 impala::FragmentMgr::FragmentExecThread (this=0x7273b00, exec_state=0xdaf55600) at /data/3/jenkins/workspace/impala-master-64bit-PRIVATE-fast/repos/Impala/be/src/service/fragment-mgr.cc:100
      

      In the case I observed, this was happening on the 6 node RM cluster when running 24 concurrent TPCDS streams across 2 YARN queues (12 streams in each queue) with preemption disabled. I don't have any reason to believe this is related to the llama-integration code, but rather I think that the heavy RM workload caused some calls to MemTracker::TryConsume by the BufferedBlockMgr to fail (waiting 5sec and then timing out due to lack of resources) exposed some new possible race conditions. This is just my theory, it needs to be proven. This didn't happen when there was a single queue, and I suspect that may be due the further limitations on resources in one of the queues because preemption wasn't enabled. As I mentioned previously, I don't think this is actually related to the llama integration code but rather that this was exposed due to new interleavings between threads, where some calls to the MemTracker now take longer (5sec to timeout) and then fail to TryConsume.

      Casey observed the same crash on a non-RM cluster, but had Kerberos enabled.

      Attachments

        Issue Links

          Activity

            People

              sailesh Sailesh Mukil
              mjacobs Matthew Jacobs
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: