[IMPALA-2252] Crash (likely race) tearing down BufferedBlockMgr on query failure - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: Impala 2.2
Fix Version/s: Impala 2.3.0, Impala 2.2.9
Component/s: None
Labels:

Target Version:

Impala 2.3.0, Impala 2.2.9

Description

When running a heavy workload on a 6-node cluster (happened to have RM enabled, but another cluster repro'd without RM but with Kerberos), a number of impalads crashed while tearing down the BufferedBlockMgr.

The impalad crashes with the following stack:

#9  impala::ErrorCount (errors=Cannot access memory at address 0xa00000008018
) at /data/3/jenkins/workspace/impala-master-64bit-PRIVATE-fast/repos/Impala/be/src/util/error-util.cc:192
#10 0x0000000000976f92 in impala::RuntimeState::LogError (this=0x8bc3db00, message=...) at /data/3/jenkins/workspace/impala-master-64bit-PRIVATE-fast/repos/Impala/be/src/runtime/runtime-state.cc:229
#11 0x00000000009886ae in impala::BufferedBlockMgr::WriteComplete (this=0x4931eb000, block=<value optimized out>, write_status=...) at /data/3/jenkins/workspace/impala-master-64bit-PRIVATE-fast/repos/Impala/be/src/runtime/buffered-block-mgr.cc:778
#12 0x000000000099e930 in operator() (this=0x46b80f00, status=...) at /usr/include/boost/function/function_template.hpp:1013
#13 impala::DiskIoMgr::RequestContext::Cancel (this=0x46b80f00, status=...) at /data/3/jenkins/workspace/impala-master-64bit-PRIVATE-fast/repos/Impala/be/src/runtime/disk-io-mgr-reader-context.cc:83
#14 0x000000000099553c in impala::DiskIoMgr::CancelContext (this=<value optimized out>, context=0x46b80f00, wait_for_disks_completion=true) at /data/3/jenkins/workspace/impala-master-64bit-PRIVATE-fast/repos/Impala/be/src/runtime/disk-io-mgr.cc:420
#15 0x00000000009955ed in impala::DiskIoMgr::UnregisterContext (this=0x599f200, reader=0x46b80f00) at /data/3/jenkins/workspace/impala-master-64bit-PRIVATE-fast/repos/Impala/be/src/runtime/disk-io-mgr.cc:388
#16 0x0000000000989821 in impala::BufferedBlockMgr::~BufferedBlockMgr (this=0x4931eb000, __in_chrg=<value optimized out>) at /data/3/jenkins/workspace/impala-master-64bit-PRIVATE-fast/repos/Impala/be/src/runtime/buffered-block-mgr.cc:495
#17 0x000000000098f162 in checked_delete<impala::BufferedBlockMgr> (this=<value optimized out>) at /usr/include/boost/checked_delete.hpp:34
#18 boost::detail::sp_counted_impl_p<impala::BufferedBlockMgr>::dispose (this=<value optimized out>) at /usr/include/boost/smart_ptr/detail/sp_counted_impl.hpp:78
#19 0x0000000000979335 in release (this=0xdaf52900, __in_chrg=<value optimized out>) at /usr/include/boost/smart_ptr/detail/sp_counted_base_gcc_x86.hpp:145
#20 ~shared_count (this=0xdaf52900, __in_chrg=<value optimized out>) at /usr/include/boost/smart_ptr/detail/shared_count.hpp:217
#21 ~shared_ptr (this=0xdaf52900, __in_chrg=<value optimized out>) at /usr/include/boost/smart_ptr/shared_ptr.hpp:169
#22 reset (this=0xdaf52900, __in_chrg=<value optimized out>) at /usr/include/boost/smart_ptr/shared_ptr.hpp:386
#23 impala::RuntimeState::~RuntimeState (this=0xdaf52900, __in_chrg=<value optimized out>) at /data/3/jenkins/workspace/impala-master-64bit-PRIVATE-fast/repos/Impala/be/src/runtime/runtime-state.cc:95
#24 0x0000000000be78b1 in checked_delete<impala::RuntimeState> (this=0xdaf55810, __in_chrg=<value optimized out>) at /usr/include/boost/checked_delete.hpp:34
#25 ~scoped_ptr (this=0xdaf55810, __in_chrg=<value optimized out>) at /usr/include/boost/smart_ptr/scoped_ptr.hpp:80
#26 impala::PlanFragmentExecutor::~PlanFragmentExecutor (this=0xdaf55810, __in_chrg=<value optimized out>) at /data/3/jenkins/workspace/impala-master-64bit-PRIVATE-fast/repos/Impala/be/src/runtime/plan-fragment-executor.cc:78
#27 0x0000000000a2efee in ~FragmentExecState (x=0xdaf55600) at /data/3/jenkins/workspace/impala-master-64bit-PRIVATE-fast/repos/Impala/be/src/service/fragment-exec-state.h:42
#28 boost::checked_delete<impala::FragmentMgr::FragmentExecState> (x=0xdaf55600) at /usr/include/boost/checked_delete.hpp:34
#29 0x00000000007766a9 in release (this=<value optimized out>, __in_chrg=<value optimized out>) at /usr/include/boost/smart_ptr/detail/sp_counted_base_gcc_x86.hpp:145
#30 boost::detail::shared_count::~shared_count (this=<value optimized out>, __in_chrg=<value optimized out>) at /usr/include/boost/smart_ptr/detail/shared_count.hpp:217
#31 0x0000000000a2d1d8 in ~shared_ptr (this=0x7273b00, exec_state=0xdaf55600) at /usr/include/boost/smart_ptr/shared_ptr.hpp:169
#32 impala::FragmentMgr::FragmentExecThread (this=0x7273b00, exec_state=0xdaf55600) at /data/3/jenkins/workspace/impala-master-64bit-PRIVATE-fast/repos/Impala/be/src/service/fragment-mgr.cc:100

In the case I observed, this was happening on the 6 node RM cluster when running 24 concurrent TPCDS streams across 2 YARN queues (12 streams in each queue) with preemption disabled. I don't have any reason to believe this is related to the llama-integration code, but rather I think that the heavy RM workload caused some calls to MemTracker::TryConsume by the BufferedBlockMgr to fail (waiting 5sec and then timing out due to lack of resources) exposed some new possible race conditions. This is just my theory, it needs to be proven. This didn't happen when there was a single queue, and I suspect that may be due the further limitations on resources in one of the queues because preemption wasn't enabled. As I mentioned previously, I don't think this is actually related to the llama integration code but rather that this was exposed due to new interleavings between threads, where some calls to the MemTracker now take longer (5sec to timeout) and then fail to TryConsume.

Casey observed the same crash on a non-RM cluster, but had Kerberos enabled.