Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4890

stress crash: Coordinator race between TearDown() and GetNext() (crash dereferencing executor_)

    Details

      Description

      The stress test had several successful runs (including with COMPUTE STATS statements) over the weekend at this commit:

      IMPALA-4876: Remove _test suffix from test names
      

      ... but one of the runs crashed here:

      0  0x00000039c6432625 in raise () from /lib64/libc.so.6
      #1  0x00000039c6433e05 in abort () from /lib64/libc.so.6
      #2  0x00007fb4c8c80a55 in os::abort(bool) () from /usr/java/jdk1.7.0_67-cloudera/jre/lib/amd64/server/libjvm.so
      #3  0x00007fb4c8e00f87 in VMError::report_and_die() () from /usr/java/jdk1.7.0_67-cloudera/jre/lib/amd64/server/libjvm.so
      #4  0x00007fb4c8c8596f in JVM_handle_linux_signal () from /usr/java/jdk1.7.0_67-cloudera/jre/lib/amd64/server/libjvm.so
      #5  <signal handler called>
      #6  0x00000000019aaec2 in boost::scoped_ptr<impala::RuntimeState>::get (this=0x3d8)
          at /usr/src/debug/impala-2.7.0-cdh5.11.0-SNAPSHOT/toolchain/boost-1.57.0-p1/include/boost/smart_ptr/scoped_ptr.hpp:105
      #7  0x00000000019a9598 in impala::PlanFragmentExecutor::runtime_state (this=0x2b0)
          at /usr/src/debug/impala-2.7.0-cdh5.11.0-SNAPSHOT/be/src/runtime/plan-fragment-executor.h:140
      #8  0x00000000019a4877 in impala::Coordinator::runtime_state (this=0x7fa7e9eb9b00)
          at /usr/src/debug/impala-2.7.0-cdh5.11.0-SNAPSHOT/be/src/runtime/coordinator.cc:1557
      #9  0x000000000199f596 in impala::Coordinator::GetNext (this=0x7fa7e9eb9b00, results=0x7fa193cef840, max_rows=1024, eos=0x7fa13e9b0f49)
          at /usr/src/debug/impala-2.7.0-cdh5.11.0-SNAPSHOT/be/src/runtime/coordinator.cc:1110
      #10 0x00000000014e7bea in impala::ImpalaServer::QueryExecState::FetchRowsInternal (this=0x7fa13e9b0400, max_rows=1024, fetched_rows=0x7fa193cef840)
          at /usr/src/debug/impala-2.7.0-cdh5.11.0-SNAPSHOT/be/src/service/query-exec-state.cc:766
      #11 0x00000000014e7067 in impala::ImpalaServer::QueryExecState::FetchRows (this=0x7fa13e9b0400, max_rows=1024, fetched_rows=0x7fa193cef840)
          at /usr/src/debug/impala-2.7.0-cdh5.11.0-SNAPSHOT/be/src/service/query-exec-state.cc:672
      #12 0x00000000014c2cf0 in impala::ImpalaServer::FetchInternal (this=0xe0fd000, query_id=..., fetch_size=1024, fetch_first=false, fetch_results=0x2adb7dd0)
          at /usr/src/debug/impala-2.7.0-cdh5.11.0-SNAPSHOT/be/src/service/impala-hs2-server.cc:219
      #13 0x00000000014cc691 in impala::ImpalaServer::FetchResults (this=0xe0fd000, return_val=..., request=...)
          at /usr/src/debug/impala-2.7.0-cdh5.11.0-SNAPSHOT/be/src/service/impala-hs2-server.cc:775
      #14 0x000000000151a971 in impala::ChildQuery::ExecAndFetch (this=0x2adb7d40)
          at /usr/src/debug/impala-2.7.0-cdh5.11.0-SNAPSHOT/be/src/service/child-query.cc:84
      #15 0x000000000151fe15 in impala::ChildQueryExecutor::ExecChildQueries (this=0x7faabce103c0)
          at /usr/src/debug/impala-2.7.0-cdh5.11.0-SNAPSHOT/be/src/service/child-query.cc:169
      #16 0x00000000015220c7 in boost::_mfi::mf0<void, impala::ChildQueryExecutor>::operator() (this=0x7fabf5964c48, p=0x7faabce103c0)
          at /usr/src/debug/impala-2.7.0-cdh5.11.0-SNAPSHOT/toolchain/boost-1.57.0-p1/include/boost/bind/mem_fn_template.hpp:49
      #17 0x0000000001521e10 in boost::_bi::list1<boost::_bi::value<impala::ChildQueryExecutor*> >::operator()<boost::_mfi::mf0<void, impala::ChildQueryExecutor>, boost::_bi::list0> (this=0x7fabf5964c58, f=..., a=...)
          at /usr/src/debug/impala-2.7.0-cdh5.11.0-SNAPSHOT/toolchain/boost-1.57.0-p1/include/boost/bind/bind.hpp:253
      #18 0x0000000001521c4b in boost::_bi::bind_t<void, boost::_mfi::mf0<void, impala::ChildQueryExecutor>, boost::_bi::list1<boost::_bi::value<impala::ChildQueryExecutor*> > >::operator() (this=0x7fabf5964c48)
          at /usr/src/debug/impala-2.7.0-cdh5.11.0-SNAPSHOT/toolchain/boost-1.57.0-p1/include/boost/bind/bind_template.hpp:20
      #19 0x0000000001521a9a in boost::detail::function::void_function_obj_invoker0<boost::_bi::bind_t<void, boost::_mfi::mf0<void, impala::ChildQueryExecutor>, boost::_bi::list1<boost::_bi::value<impala::ChildQueryExecutor*> > >, void>::invoke (function_obj_ptr=...)
          at /usr/src/debug/impala-2.7.0-cdh5.11.0-SNAPSHOT/toolchain/boost-1.57.0-p1/include/boost/function/function_template.hpp:153
      #20 0x000000000133bb90 in boost::function0<void>::operator() (this=0x7fabf5964c40)
          at /usr/src/debug/impala-2.7.0-cdh5.11.0-SNAPSHOT/toolchain/boost-1.57.0-p1/include/boost/function/function_template.hpp:767
      #21 0x00000000015df1fb in impala::Thread::SuperviseThread (name=..., category=..., functor=..., thread_started=0x7fb421855a50)
          at /usr/src/debug/impala-2.7.0-cdh5.11.0-SNAPSHOT/be/src/util/thread.cc:317
      #22 0x00000000015e61d4 in boost::_bi::list4<boost::_bi::value<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::_bi::value<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::_bi::value<boost::function<void()> >, boost::_bi::value<impala::Promise<long int>*> >::operator()<void (*)(const std::basic_string<char>&, const std::basic_string<char>&, boost::function<void()>, impala::Promise<long int>*), boost::_bi::list0>(boost::_bi::type<void>, void (*&)(const std::basic_string<char, std::char_traits<char>, std::allocator<char> > &, const std::basic_string<char, std::char_traits<char>, std::allocator<char> > &, boost::function<void()>, impala::Promise<long> *), boost::_bi::list0 &, int) (this=0x7fadc069d3c0,
          f=@0x7fadc069d3b8, a=...) at /usr/src/debug/impala-2.7.0-cdh5.11.0-SNAPSHOT/toolchain/boost-1.57.0-p1/include/boost/bind/bind.hpp:457
      #23 0x00000000015e6117 in boost::_bi::bind_t<void, void (*)(const std::basic_string<char, std::char_traits<char>, std::allocator<char> >&, const std::basic_string<char, std::char_traits<char>, std::allocator<char> >&, boost::function<void()>, impala::Promise<long int>*), boost::_bi::list4<boost::_bi::value<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::_bi::value<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::_bi::value<boost::function<void()> >, boost::_bi::value<impala::Promise<long int>*> > >::operator()(void) (this=0x7fadc069d3b8)
          at /usr/src/debug/impala-2.7.0-cdh5.11.0-SNAPSHOT/toolchain/boost-1.57.0-p1/include/boost/bind/bind_template.hpp:20
      #24 0x00000000015e6072 in boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(const std::basic_string<char, std::char_traits<char>, std::allocator<char> >&, const std::basic_string<char, std::char_traits<char>, std::allocator<char> >&, boost::function<void()>, impala::Promise<long int>*), boost::_bi::list4<boost::_bi::value<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::_bi::value<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::_bi::value<boost::function<void()> >, boost::_bi::value<impala::Promise<long int>*> > > >::run(void) (
          this=0x7fadc069d200) at /usr/src/debug/impala-2.7.0-cdh5.11.0-SNAPSHOT/toolchain/boost-1.57.0-p1/include/boost/thread/detail/thread.hpp:116
      #25 0x0000000001a4783a in thread_proxy ()
      #26 0x00000039c68079d1 in start_thread () from /lib64/libpthread.so.0
      #27 0x00000039c64e88fd in clone () from /lib64/libc.so.6
      

        Issue Links

          Activity

          Hide
          jbapple Jim Apple added a comment -

          Why P2 and not P1?

          Show
          jbapple Jim Apple added a comment - Why P2 and not P1?
          Hide
          mikesbrown Michael Brown added a comment -

          Because there were several non-failing runs of the same test at the same commit.

          Show
          mikesbrown Michael Brown added a comment - Because there were several non-failing runs of the same test at the same commit.
          Hide
          jbapple Jim Apple added a comment -

          My feeling is that all crashes are P1s, but I'm actually not sure if that reflects the way previous flaky crash issues are prioritized. I imagine that when you have filed flaky crashes from stress tests in the past, you have filed them at P2 and they have remained at that priority for their entire "Open" lifetime?

          Show
          jbapple Jim Apple added a comment - My feeling is that all crashes are P1s, but I'm actually not sure if that reflects the way previous flaky crash issues are prioritized. I imagine that when you have filed flaky crashes from stress tests in the past, you have filed them at P2 and they have remained at that priority for their entire "Open" lifetime?
          Hide
          mikesbrown Michael Brown added a comment -

          In the past, some stress crashes initially filed as P1s were then found to be flaky and downgraded to P2. I do not have any examples.

          Show
          mikesbrown Michael Brown added a comment - In the past, some stress crashes initially filed as P1s were then found to be flaky and downgraded to P2. I do not have any examples.
          Hide
          mikesbrown Michael Brown added a comment -

          Note: I hit IMPALA-4226 when dealing with the minidump associated with the crash.

          Show
          mikesbrown Michael Brown added a comment - Note: I hit IMPALA-4226 when dealing with the minidump associated with the crash.
          Hide
          mikesbrown Michael Brown added a comment -

          This crash happened again. Dan Hecht, is P2 still a fair assessment? Should this be raised to P1? It's happened 2 of 7 runs so far.

          Show
          mikesbrown Michael Brown added a comment - This crash happened again. Dan Hecht , is P2 still a fair assessment? Should this be raised to P1? It's happened 2 of 7 runs so far.
          Hide
          mikesbrown Michael Brown added a comment -

          Dan Hecht please take a look and re-assign/re-prioritize as necessary. This happened again over night.

          Show
          mikesbrown Michael Brown added a comment - Dan Hecht please take a look and re-assign/re-prioritize as necessary. This happened again over night.
          Hide
          dhecht Dan Hecht added a comment -

          Generally, we want to treat all crashes as P1s, so let's start them there and downgrade only if there is a good reason to do so (e.g. unsupported configuration).

          Show
          dhecht Dan Hecht added a comment - Generally, we want to treat all crashes as P1s, so let's start them there and downgrade only if there is a good reason to do so (e.g. unsupported configuration).
          Hide
          dhecht Dan Hecht added a comment -

          Apparently Coordinator::TearDown() has already been called.

          (gdb) f 9
          #9  0x000000000199f596 in impala::Coordinator::GetNext (this=0x7fa7e9eb9b00, results=0x7fa193cef840, max_rows=1024, eos=0x7fa13e9b0f49) at /usr/src/debug/impala-2.7.0-cdh5.11.0-SNAPSHOT/be/src/runtime/coordinator.cc:1110
          1110    in /usr/src/debug/impala-2.7.0-cdh5.11.0-SNAPSHOT/be/src/runtime/coordinator.cc
          (gdb) p query_state_
          $18 = (impala::QueryState *) 0x0
          (gdb) p coord_sink_
          $19 = (impala::PlanRootSink *) 0x0
          (gdb) p torn_down_
          $20 = true
          
          Show
          dhecht Dan Hecht added a comment - Apparently Coordinator::TearDown() has already been called. (gdb) f 9 #9 0x000000000199f596 in impala::Coordinator::GetNext ( this =0x7fa7e9eb9b00, results=0x7fa193cef840, max_rows=1024, eos=0x7fa13e9b0f49) at /usr/src/debug/impala-2.7.0-cdh5.11.0-SNAPSHOT/be/src/runtime/coordinator.cc:1110 1110 in /usr/src/debug/impala-2.7.0-cdh5.11.0-SNAPSHOT/be/src/runtime/coordinator.cc (gdb) p query_state_ $18 = (impala::QueryState *) 0x0 (gdb) p coord_sink_ $19 = (impala::PlanRootSink *) 0x0 (gdb) p torn_down_ $20 = true
          Hide
          dhecht Dan Hecht added a comment -

          I think it's likely a regression from this commit:

          commit 85edc15fefe9ab9b2883b9c9f7451efc7e26ff94
          Author: Tim Armstrong <tarmstrong@cloudera.com>
          Date:   Tue Dec 20 09:02:31 2016 -0800
          

          TearDown() is called from QueryExecState::Done() called from ImpalaServer::UnregisterQuery() called from CloseOperation(). I don't think we have any synchronization to make sure a Fetch RPC isn't still in progress at that point.

          We've probably only seen this with COMPUTE STATS because sane clients wouldn't execute CloseOperation() and FetchResults() RPCs concurrently, but the ChildQuery callers do that.

          Show
          dhecht Dan Hecht added a comment - I think it's likely a regression from this commit: commit 85edc15fefe9ab9b2883b9c9f7451efc7e26ff94 Author: Tim Armstrong <tarmstrong@cloudera.com> Date: Tue Dec 20 09:02:31 2016 -0800 TearDown() is called from QueryExecState::Done() called from ImpalaServer::UnregisterQuery() called from CloseOperation(). I don't think we have any synchronization to make sure a Fetch RPC isn't still in progress at that point. We've probably only seen this with COMPUTE STATS because sane clients wouldn't execute CloseOperation() and FetchResults() RPCs concurrently, but the ChildQuery callers do that.
          Hide
          henryr Henry Robinson added a comment -

          FWIW, I've just seen a similar regression from that commit where filter_mem_tracker_ is referred to in an RPC handler, but gets set to nullptr without synchronisation in Coordinator::TearDown(). That's on my KRPC branch, so not a bug in trunk, but it has exactly the same 'shape' as the issue here.

          Show
          henryr Henry Robinson added a comment - FWIW, I've just seen a similar regression from that commit where filter_mem_tracker_ is referred to in an RPC handler, but gets set to nullptr without synchronisation in Coordinator::TearDown() . That's on my KRPC branch, so not a bug in trunk, but it has exactly the same 'shape' as the issue here.
          Hide
          dhecht Dan Hecht added a comment -

          It also doesn't help that we have this nullptr checks which probably help hide this race.

          RuntimeState* Coordinator::runtime_state() {
            return executor() == NULL ? NULL : executor()->runtime_state();
          }
          
          Show
          dhecht Dan Hecht added a comment - It also doesn't help that we have this nullptr checks which probably help hide this race. RuntimeState* Coordinator::runtime_state() { return executor() == NULL ? NULL : executor()->runtime_state(); }
          Hide
          dhecht Dan Hecht added a comment -

          Henry Robinson, which RPC handler was it in your case?

          Show
          dhecht Dan Hecht added a comment - Henry Robinson , which RPC handler was it in your case?
          Hide
          henryr Henry Robinson added a comment -

          UpdateFilter. But like I say, that on a branch with some heavy changes in that path, I'm not sure the race can happen on trunk (by luck more than judgment).

          Show
          henryr Henry Robinson added a comment - UpdateFilter . But like I say, that on a branch with some heavy changes in that path, I'm not sure the race can happen on trunk (by luck more than judgment).
          Hide
          dhecht Dan Hecht added a comment -

          Yeah, just wanted to check. On master, it does look like that FilterState::Disable() and filter_lock_ would act as enough of a barrier to prevent the bug with filter_mem_tracker_.

          Show
          dhecht Dan Hecht added a comment - Yeah, just wanted to check. On master, it does look like that FilterState::Disable() and filter_lock_ would act as enough of a barrier to prevent the bug with filter_mem_tracker_ .
          Hide
          mikesbrown Michael Brown added a comment -

          Note: the stress test has been running nightly, but this crash has not been reproduced since Feb 11. This bug has been found a total of 4 times since Feb 6.

          Show
          mikesbrown Michael Brown added a comment - Note: the stress test has been running nightly, but this crash has not been reproduced since Feb 11. This bug has been found a total of 4 times since Feb 6.
          Hide
          srus Silvius Rus added a comment -

          Michael Brown, Marcel Kornacker, Dan Hecht, what's the latest here? Do we think this is fixed? Can it affect 2.8 or earlier versions or have we ever seen it only on master after 2.8 was cut?

          Show
          srus Silvius Rus added a comment - Michael Brown , Marcel Kornacker , Dan Hecht , what's the latest here? Do we think this is fixed? Can it affect 2.8 or earlier versions or have we ever seen it only on master after 2.8 was cut?
          Hide
          mikesbrown Michael Brown added a comment -

          Silvius Rus The earliest I've seen this is master (future 2.9), on commit IMPALA-4876: Remove _test suffix from test names.

          Show
          mikesbrown Michael Brown added a comment - Silvius Rus The earliest I've seen this is master (future 2.9), on commit IMPALA-4876 : Remove _test suffix from test names .
          Hide
          srus@cloudera.com Silvius Rus added a comment -

          Michael Brown, have you seen this since? If not, does it make sense to close as Cannot Repro?

          Show
          srus@cloudera.com Silvius Rus added a comment - Michael Brown , have you seen this since? If not, does it make sense to close as Cannot Repro?
          Hide
          dhecht Dan Hecht added a comment -

          Silvius Rus, it shouldn't be closed. It's a real bug, has already been diagnosed, and needs to be fixed for the next release. It's not in 2.8.

          Show
          dhecht Dan Hecht added a comment - Silvius Rus , it shouldn't be closed. It's a real bug, has already been diagnosed, and needs to be fixed for the next release. It's not in 2.8.
          Hide
          mikesbrown Michael Brown added a comment -

          What Dan Hecht said.

          Show
          mikesbrown Michael Brown added a comment - What Dan Hecht said.
          Hide
          srus@cloudera.com Silvius Rus added a comment -

          Marcel Kornacker, are you planning to work on this for 2.9?

          Show
          srus@cloudera.com Silvius Rus added a comment - Marcel Kornacker , are you planning to work on this for 2.9?
          Hide
          marcelk Marcel Kornacker added a comment -

          I am.

          Show
          marcelk Marcel Kornacker added a comment - I am.
          Hide
          joemcdonnell Joe McDonnell added a comment -

          A variant of this happened in TestCancellationSerial.test_cancel_insert with query "compute stats lineitem" on asf master core. Verified that the core shows the state that Dan described above (query_state_ = null, coord_sink_ = null, torn_down_ = true). I uploaded the core to impala-desktop under the IMPALA-4890/impala_asf_master_core_1884 directory.

          Show
          joemcdonnell Joe McDonnell added a comment - A variant of this happened in TestCancellationSerial.test_cancel_insert with query "compute stats lineitem" on asf master core. Verified that the core shows the state that Dan described above (query_state_ = null, coord_sink_ = null, torn_down_ = true). I uploaded the core to impala-desktop under the IMPALA-4890 /impala_asf_master_core_1884 directory.
          Hide
          mikesbrown Michael Brown added a comment -

          I can say the stress test hasn't hit this since approximately the end of March.

          Show
          mikesbrown Michael Brown added a comment - I can say the stress test hasn't hit this since approximately the end of March.
          Hide
          mikesbrown Michael Brown added a comment -

          Got a new instance of this via the stress test. I'll report whether I can get it to reproduce more readily.

          Show
          mikesbrown Michael Brown added a comment - Got a new instance of this via the stress test. I'll report whether I can get it to reproduce more readily.
          Hide
          marcelk Marcel Kornacker added a comment -

          commit fcc3b9eded82953c04fd7510d724a2c9b7ff59a3
          Author: Marcel Kornacker <marcel@cloudera.com>
          Date: Sun May 14 21:32:41 2017 -0700

          IMPALA-4890/5143: Coordinator race involving TearDown()

          TearDown() releases resources and destroys control
          structures (the QueryState reference), and it can be called
          while a concurrent thread executes Exec() or might call
          GetNext() in the future. The solution is not to destroy
          the control structures.

          This also releases resources automatically at the end
          of query execution.

          Change-Id: I457a6424a0255c137336c4bc01a6e7ed830d18c7

          Show
          marcelk Marcel Kornacker added a comment - commit fcc3b9eded82953c04fd7510d724a2c9b7ff59a3 Author: Marcel Kornacker <marcel@cloudera.com> Date: Sun May 14 21:32:41 2017 -0700 IMPALA-4890 /5143: Coordinator race involving TearDown() TearDown() releases resources and destroys control structures (the QueryState reference), and it can be called while a concurrent thread executes Exec() or might call GetNext() in the future. The solution is not to destroy the control structures. This also releases resources automatically at the end of query execution. Change-Id: I457a6424a0255c137336c4bc01a6e7ed830d18c7

            People

            • Assignee:
              marcelk Marcel Kornacker
              Reporter:
              mikesbrown Michael Brown
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development