Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
Impala 2.5.0
Description
While running the stress tests with a custom patch for IMPALA-2592, I'm hitting a crash in DoRpc() with the following stack:
Stack: [0x00007fab64253000,0x00007fab64c54000], sp=0x00007fab64c51f90, free space=10235k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) C [impalad+0x10163c4] impala::Status impala::ClientConnection<impala::ImpalaInternalServiceClient>::DoRpc<void (impala::ImpalaInternalServiceClient::*)(impala::TReportExecStatusResult&, impala::TReportExecStatusParams const&), impala::TReportExecStatusParams, impala::TReportExecStatusResult>(void (impala::ImpalaInternalServiceClient::* const&)(impala::TReportExecStatusResult&, impala::TReportExecStatusParams const&), impala::TReportExecStatusParams const&, impala::TReportExecStatusResult*)+0x108 C [impalad+0x1015596] impala::FragmentMgr::FragmentExecState::ReportStatusCb(impala::Status const&, impala::RuntimeProfile*, bool)+0x598 C [impalad+0x100fe97] boost::_mfi::mf3<void, impala::FragmentMgr::FragmentExecState, impala::Status const&, impala::RuntimeProfile*, bool>::operator()(impala::FragmentMgr::FragmentExecState*, impala::Status const&, impala::RuntimeProfile*, bool) const+0x7d C [impalad+0x100f65a] void boost::_bi::list4<boost::_bi::value<impala::FragmentMgr::FragmentExecState*>, boost::arg<1>, boost::arg<2>, boost::arg<3> >::operator()<boost::_mfi::mf3<void, impala::FragmentMgr::FragmentExecState, impala::Status const&, impala::RuntimeProfile*, bool>, boost::_bi::list3<impala::Status const&, impala::RuntimeProfile*&, bool&> >(boost::_bi::type<void>, boost::_mfi::mf3<void, impala::FragmentMgr::FragmentExecState, impala::Status const&, impala::RuntimeProfile*, bool>&, boost::_bi::list3<impala::Status const&, impala::RuntimeProfile*&, bool&>&, int)+0xa8 C [impalad+0x100efed] void boost::_bi::bind_t<void, boost::_mfi::mf3<void, impala::FragmentMgr::FragmentExecState, impala::Status const&, impala::RuntimeProfile*, bool>, boost::_bi::list4<boost::_bi::value<impala::FragmentMgr::FragmentExecState*>, boost::arg<1>, boost::arg<2>, boost::arg<3> > >::operator()<impala::Status const, impala::RuntimeProfile*, bool>(impala::Status const&, impala::RuntimeProfile*&, bool&)+0x53 C [impalad+0x100eacd] boost::detail::function::void_function_obj_invoker3<boost::_bi::bind_t<void, boost::_mfi::mf3<void, impala::FragmentMgr::FragmentExecState, impala::Status const&, impala::RuntimeProfile*, bool>, boost::_bi::list4<boost::_bi::value<impala::FragmentMgr::FragmentExecState*>, boost::arg<1>, boost::arg<2>, boost::arg<3> > >, void, impala::Status const&, impala::RuntimeProfile*, bool>::invoke(boost::detail::function::function_buffer&, impala::Status const&, impala::RuntimeProfile*, bool)+0x39 C [impalad+0x13fa176] boost::function3<void, impala::Status const&, impala::RuntimeProfile*, bool>::operator()(impala::Status const&, impala::RuntimeProfile*, bool) const+0x68 C [impalad+0x13f7e55] impala::PlanFragmentExecutor::SendReport(bool)+0x10b C [impalad+0x13f7aef] impala::PlanFragmentExecutor::ReportProfile()+0x6bf C [impalad+0x13fbc4b] boost::_mfi::mf0<void, impala::PlanFragmentExecutor>::operator()(impala::PlanFragmentExecutor*) const+0x65 C [impalad+0x13fb992] void boost::_bi::list1<boost::_bi::value<impala::PlanFragmentExecutor*> >::operator()<boost::_mfi::mf0<void, impala::PlanFragmentExecutor>, boost::_bi::list0>(boost::_bi::type<void>, boost::_mfi::mf0<void, impala::PlanFragmentExecutor>&, boost::_bi::list0&, int)+0x4a C [impalad+0x13fb5f7] boost::_bi::bind_t<void, boost::_mfi::mf0<void, impala::PlanFragmentExecutor>, boost::_bi::list1<boost::_bi::value<impala::PlanFragmentExecutor*> > >::operator()()+0x3b C [impalad+0x13fb3bc] boost::detail::function::void_function_obj_invoker0<boost::_bi::bind_t<void, boost::_mfi::mf0<void, impala::PlanFragmentExecutor>, boost::_bi::list1<boost::_bi::value<impala::PlanFragmentExecutor*> > >, void>::invoke(boost::detail::function::function_buffer&)+0x20 C [impalad+0xe1fc76] boost::function0<void>::operator()() const+0x52 C [impalad+0x10cd6b9] impala::Thread::SuperviseThread(std::string const&, std::string const&, boost::function<void ()()>, impala::Promise<long>*)+0x2c5 C [impalad+0x10d4d34] void boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()()> >, boost::_bi::value<impala::Promise<long>*> >::operator()<void (*)(std::string const&, std::string const&, boost::function<void ()()>, impala::Promise<long>*), boost::_bi::list0>(boost::_bi::type<void>, void (*&)(std::string const&, std::string const&, boost::function<void ()()>, impala::Promise<long>*), boost::_bi::list0&, int)+0xb2 C [impalad+0x10d4c77] boost::_bi::bind_t<void, void (*)(std::string const&, std::string const&, boost::function<void ()()>, impala::Promise<long>*), boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()()> >, boost::_bi::value<impala::Promise<long>*> > >::operator()()+0x3b C [impalad+0x10d4c3a] boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(std::string const&, std::string const&, boost::function<void ()()>, impala::Promise<long>*), boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()()> >, boost::_bi::value<impala::Promise<long>* > > > >::run()+0x1e
Looking at the disassembly, it fails here:
Dump of assembler code for function impala::ClientConnection<impala::ImpalaInternalServiceClient>::DoRpc<void (impala::ImpalaInternalServiceClient::*)(impala::TReportExecStatusResult&, impala::TReportExecStatusParams const&), impala::TReportExecStatusParams, impala::TReportExecStatusResult>(void (impala::ImpalaInternalServiceClient::*&)(impala::ImpalaInternalServiceClient*, impala::TReportExecStatusResult&, impala::TReportExecStatusParams const&), impala::TReportExecStatusParams const&, impala::TReportExecStatusResult*): 0x00000000014162bc <+0>: push %rbp 0x00000000014162bd <+1>: mov %rsp,%rbp 0x00000000014162c0 <+4>: push %r13 0x00000000014162c2 <+6>: push %r12 0x00000000014162c4 <+8>: push %rbx 0x00000000014162c5 <+9>: sub $0xf8,%rsp 0x00000000014162cc <+16>: mov %rdi,-0xe8(%rbp) 0x00000000014162d3 <+23>: mov %rsi,-0xf0(%rbp) 0x00000000014162da <+30>: mov %rdx,-0xf8(%rbp) 0x00000000014162e1 <+37>: mov %rcx,-0x100(%rbp) 0x00000000014162e8 <+44>: mov %r8,-0x108(%rbp) 0x00000000014162ef <+51>: cmpq $0x0,-0x108(%rbp) 0x00000000014162f7 <+59>: sete %al 0x00000000014162fa <+62>: movzbl %al,%eax 0x00000000014162fd <+65>: mov $0x0,%ebx 0x0000000001416302 <+70>: mov $0x0,%r12d 0x0000000001416308 <+76>: test %rax,%rax 0x000000000141630b <+79>: je 0x1416375 <impala::ClientConnection<impala::ImpalaInternalServiceClient>::DoRpc<void (impala::ImpalaInternalServiceClient::*)(impala::TReportExecStatusResult&, impala::TReportExecStatusParams const&), impala::TReportExecStatusParams, impala::TReportExecStatusResult>(void (impala::ImpalaInternalServiceClient::*&)(impala::ImpalaInternalServiceClient*, impala::TReportExecStatusResult&, impala::TReportExecStatusParams const&), impala::TReportExecStatusParams const&, impala::TReportExecStatusResult*)+185> 0x000000000141630d <+81>: lea -0xe0(%rbp),%rax 0x0000000001416314 <+88>: mov $0xe3,%edx 0x0000000001416319 <+93>: lea 0xf36628(%rip),%rsi # 0x234c948 0x0000000001416320 <+100>: mov %rax,%rdi 0x0000000001416323 <+103>: callq 0x223bce0 <_ZN6google15LogMessageFatalC2EPKci> 0x0000000001416328 <+108>: mov $0x1,%ebx 0x000000000141632d <+113>: lea -0xe0(%rbp),%rax 0x0000000001416334 <+120>: mov %rax,%rdi 0x0000000001416337 <+123>: callq 0x106eab6 <google::LogMessage::stream()> 0x000000000141633c <+128>: lea 0xf36695(%rip),%rsi # 0x234c9d8 0x0000000001416343 <+135>: mov %rax,%rdi 0x0000000001416346 <+138>: callq 0x1018790 <_ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc@plt> 0x000000000141634b <+143>: mov %rax,%r13 0x000000000141634e <+146>: lea -0xc1(%rbp),%rax 0x0000000001416355 <+153>: mov %rax,%rdi 0x0000000001416358 <+156>: callq 0x106eacc <google::LogMessageVoidify::LogMessageVoidify()> 0x000000000141635d <+161>: mov $0x1,%r12d 0x0000000001416363 <+167>: lea -0xc1(%rbp),%rax 0x000000000141636a <+174>: mov %r13,%rsi 0x000000000141636d <+177>: mov %rax,%rdi 0x0000000001416370 <+180>: callq 0x106ead6 <google::LogMessageVoidify::operator&(std::ostream&)> 0x0000000001416375 <+185>: test %r12b,%r12b 0x0000000001416378 <+188>: test %bl,%bl 0x000000000141637a <+190>: je 0x141638c <impala::ClientConnection<impala::ImpalaInternalServiceClient>::DoRpc<void (impala::ImpalaInternalServiceClient::*)(impala::TReportExecStatusResult&, impala::TReportExecStatusParams const&), impala::TReportExecStatusParams, impala::TReportExecStatusResult>(void (impala::ImpalaInternalServiceClient::*&)(impala::ImpalaInternalServiceClient*, impala::TReportExecStatusResult&, impala::TReportExecStatusParams const&), impala::TReportExecStatusParams const&, impala::TReportExecStatusResult*)+208> 0x000000000141637c <+192>: nop 0x000000000141637d <+193>: lea -0xe0(%rbp),%rax ---Type <return> to continue, or q <return> to quit--- 0x0000000001416384 <+200>: mov %rax,%rdi 0x0000000001416387 <+203>: callq 0x223bd00 <_ZN6google15LogMessageFatalD2Ev> 0x000000000141638c <+208>: nop 0x000000000141638d <+209>: mov -0xf8(%rbp),%rax 0x0000000001416394 <+216>: mov (%rax),%rax 0x0000000001416397 <+219>: and $0x1,%eax 0x000000000141639a <+222>: test %rax,%rax 0x000000000141639d <+225>: jne 0x14163ab <impala::ClientConnection<impala::ImpalaInternalServiceClient>::DoRpc<void (impala::ImpalaInternalServiceClient::*)(impala::TReportExecStatusResult&, impala::TReportExecStatusParams const&), impala::TReportExecStatusParams, impala::TReportExecStatusResult>(void (impala::ImpalaInternalServiceClient::*&)(impala::ImpalaInternalServiceClient*, impala::TReportExecStatusResult&, impala::TReportExecStatusParams const&), impala::TReportExecStatusParams const&, impala::TReportExecStatusResult*)+239> 0x000000000141639f <+227>: mov -0xf8(%rbp),%rax 0x00000000014163a6 <+234>: mov (%rax),%rax 0x00000000014163a9 <+237>: jmp 0x14163db <impala::ClientConnection<impala::ImpalaInternalServiceClient>::DoRpc<void (impala::ImpalaInternalServiceClient::*)(impala::TReportExecStatusResult&, impala::TReportExecStatusParams const&), impala::TReportExecStatusParams, impala::TReportExecStatusResult>(void (impala::ImpalaInternalServiceClient::*&)(impala::ImpalaInternalServiceClient*, impala::TReportExecStatusResult&, impala::TReportExecStatusParams const&), impala::TReportExecStatusParams const&, impala::TReportExecStatusResult*)+287> 0x00000000014163ab <+239>: mov -0xf0(%rbp),%rax 0x00000000014163b2 <+246>: mov 0x8(%rax),%rdx 0x00000000014163b6 <+250>: mov -0xf8(%rbp),%rax 0x00000000014163bd <+257>: mov 0x8(%rax),%rax 0x00000000014163c1 <+261>: add %rdx,%rax => 0x00000000014163c4 <+264>: mov (%rax),%rdx 0x00000000014163c7 <+267>: mov -0xf8(%rbp),%rax 0x00000000014163ce <+274>: mov (%rax),%rax 0x00000000014163d1 <+277>: sub $0x1,%rax 0x00000000014163d5 <+281>: add %rdx,%rax 0x00000000014163d8 <+284>: mov (%rax),%rax 0x00000000014163db <+287>: mov -0xf0(%rbp),%rdx 0x00000000014163e2 <+294>: mov 0x8(%rdx),%rcx 0x00000000014163e6 <+298>: mov -0xf8(%rbp),%rdx 0x00000000014163ed <+305>: mov 0x8(%rdx),%rdx 0x00000000014163f1 <+309>: lea (%rcx,%rdx,1),%rdi 0x00000000014163f5 <+313>: mov -0x100(%rbp),%rdx 0x00000000014163fc <+320>: mov -0x108(%rbp),%rcx 0x0000000001416403 <+327>: mov %rcx,%rsi 0x0000000001416406 <+330>: callq *%rax 0x0000000001416408 <+332>: mov -0xe8(%rbp),%rax 0x000000000141640f <+339>: mov %rax,%rdi 0x0000000001416412 <+342>: callq 0x108027f <impala::Status::OK()>
At that point, the 'client_' (from the class ClientConnection) should be in $rdx, but $rdx is NULL causing the crash. This is odd and there isn't a reasonable explanation as to why it happens as of yet.
The crash does not occur immediately, it happens only after remote nodes become unreachable (which under the conditions in the following run, happens after ~2 hours:
http://sandbox.jenkins.cloudera.com/view/Impala/view/Stress/job/Impala-Stress-Test-EC2-CDH5-trunk/514/parameters/
)
Currently, it doesn't seem to be related to the patch for IMPALA-2592. It seems like the patch exposes an existing bug. I'm still digging into what causes the crash and don't know the reason yet. I will update once I have more information.
This crash does not show up without the patch. (The patch is here at: http://gerrit.cloudera.org:8080/#/c/2205/7)