Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4504

fragment-exec-state.cc:70] Check failed: status.ok() || done

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: Impala 2.8.0
    • Fix Version/s: Impala 2.8.0
    • Component/s: Backend
    • Labels:

      Description

      The stress test caused this DCHECK to fail, roughly 75% of the way into the test (~7500 queries run out of 10,000).

      #0  0x000000323f632625 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
      #1  0x000000323f633e05 in abort () at abort.c:92
      #2  0x000000000280f344 in google::DumpStackTraceAndExit() ()
      #3  0x00000000028087ad in google::LogMessage::Fail() ()
      #4  0x000000000280b0d6 in google::LogMessage::SendToLog() ()
      #5  0x00000000028082cd in google::LogMessage::Flush() ()
      #6  0x000000000280bb7e in google::LogMessageFatal::~LogMessageFatal() ()
      #7  0x000000000152ba96 in impala::FragmentMgr::FragmentExecState::ReportStatusCb (this=0x7fc0bd01cd00, status=..., profile=0x7fb67e42a5f8, done=false)
          at /usr/src/debug/impala-2.7.0-cdh5.10.0-SNAPSHOT/be/src/service/fragment-exec-state.cc:70
      #8  0x000000000152689b in boost::_mfi::mf3<void, impala::FragmentMgr::FragmentExecState, impala::Status const&, impala::RuntimeProfile*, bool>::operator() (this=0x7fc0bd01d0c0, p=0x7fc0bd01cd00, a1=..., a2=0x7fb67e42a5f8, a3=false)
          at /usr/src/debug/impala-2.7.0-cdh5.10.0-SNAPSHOT/toolchain/boost-1.57.0/include/boost/bind/mem_fn_template.hpp:393
      #9  0x000000000152605e in boost::_bi::list4<boost::_bi::value<impala::FragmentMgr::FragmentExecState*>, boost::arg<1>, boost::arg<2>, boost::arg<3> >::operator()<boost::_mfi::mf3<void, impala::FragmentMgr::FragmentExecState, impala::Status const&, impala::RuntimeProfile*, bool>, boost::_bi::list3<impala::Status const&, impala::RuntimeProfile*&, bool&> > (this=0x7fc0bd01d0d0, f=..., a=...)
          at /usr/src/debug/impala-2.7.0-cdh5.10.0-SNAPSHOT/toolchain/boost-1.57.0/include/boost/bind/bind.hpp:457
      #10 0x00000000015259f1 in boost::_bi::bind_t<void, boost::_mfi::mf3<void, impala::FragmentMgr::FragmentExecState, impala::Status const&, impala::RuntimeProfile*, bool>, boost::_bi::list4<boost::_bi::value<impala::FragmentMgr::FragmentExecState*>, boost::arg<1>, boost::arg<2>, boost::arg<3> > >::operator()<impala::Status const, impala::RuntimeProfile*, bool> (this=0x7fc0bd01d0c0, a1=..., a2=@0x7fbe6d7a74d8, a3=@0x7fbe6d7a74d4)
          at /usr/src/debug/impala-2.7.0-cdh5.10.0-SNAPSHOT/toolchain/boost-1.57.0/include/boost/bind/bind_template.hpp:116
      #11 0x00000000015254ff in boost::detail::function::void_function_obj_invoker3<boost::_bi::bind_t<void, boost::_mfi::mf3<void, impala::FragmentMgr::FragmentExecState, impala::Status const&, impala::RuntimeProfile*, bool>, boost::_bi::list4<boost::_bi::value<impala::FragmentMgr::FragmentExecState*>, boost::arg<1>, boost::arg<2>, boost::arg<3> > >, void, impala::Status const&, impala::RuntimeProfile*, bool>::invoke (function_obj_ptr=..., a0=...,
          a1=0x7fb67e42a5f8, a2=false)
          at /usr/src/debug/impala-2.7.0-cdh5.10.0-SNAPSHOT/toolchain/boost-1.57.0/include/boost/function/function_template.hpp:153
      #12 0x00000000019d60d4 in boost::function3<void, impala::Status const&, impala::RuntimeProfile*, bool>::operator() (this=0x7fc0bd01d0b8, a0=...,
          a1=0x7fb67e42a5f8, a2=false)
          at /usr/src/debug/impala-2.7.0-cdh5.10.0-SNAPSHOT/toolchain/boost-1.57.0/include/boost/function/function_template.hpp:767
      #13 0x00000000019d471c in impala::PlanFragmentExecutor::SendReport (this=0x7fc0bd01d090, done=false)
          at /usr/src/debug/impala-2.7.0-cdh5.10.0-SNAPSHOT/be/src/runtime/plan-fragment-executor.cc:449
      #14 0x00000000019d4356 in impala::PlanFragmentExecutor::ReportProfile (this=0x7fc0bd01d090)
          at /usr/src/debug/impala-2.7.0-cdh5.10.0-SNAPSHOT/be/src/runtime/plan-fragment-executor.cc:419
      #15 0x00000000019d73ef in boost::_mfi::mf0<void, impala::PlanFragmentExecutor>::operator() (this=0x7fbe6d7a7c48, p=0x7fc0bd01d090)
          at /usr/src/debug/impala-2.7.0-cdh5.10.0-SNAPSHOT/toolchain/boost-1.57.0/include/boost/bind/mem_fn_template.hpp:49
      #16 0x00000000019d711a in boost::_bi::list1<boost::_bi::value<impala::PlanFragmentExecutor*> >::operator()<boost::_mfi::mf0<void, impala::PlanFragmentExecutor>, boost::_bi::list0> (this=0x7fbe6d7a7c58, f=..., a=...)
          at /usr/src/debug/impala-2.7.0-cdh5.10.0-SNAPSHOT/toolchain/boost-1.57.0/include/boost/bind/bind.hpp:253
      #17 0x00000000019d6d61 in boost::_bi::bind_t<void, boost::_mfi::mf0<void, impala::PlanFragmentExecutor>, boost::_bi::list1<boost::_bi::value<impala::PlanFragmentExecutor*> > >::operator() (this=0x7fbe6d7a7c48)
          at /usr/src/debug/impala-2.7.0-cdh5.10.0-SNAPSHOT/toolchain/boost-1.57.0/include/boost/bind/bind_template.hpp:20
      #18 0x00000000019d6b44 in boost::detail::function::void_function_obj_invoker0<boost::_bi::bind_t<void, boost::_mfi::mf0<void, impala::PlanFragmentExecutor>, boost::_bi::list1<boost::_bi::value<impala::PlanFragmentExecutor*> > >, void>::invoke (function_obj_ptr=...)
          at /usr/src/debug/impala-2.7.0-cdh5.10.0-SNAPSHOT/toolchain/boost-1.57.0/include/boost/function/function_template.hpp:153
      #19 0x000000000133b10c in boost::function0<void>::operator() (this=0x7fbe6d7a7c40)
          at /usr/src/debug/impala-2.7.0-cdh5.10.0-SNAPSHOT/toolchain/boost-1.57.0/include/boost/function/function_template.hpp:767
      #20 0x00000000015e6561 in impala::Thread::SuperviseThread (name=..., category=..., functor=..., thread_started=0x7fbf299d5b30)
          at /usr/src/debug/impala-2.7.0-cdh5.10.0-SNAPSHOT/be/src/util/thread.cc:317
      #21 0x00000000015ed53a in boost::_bi::list4<boost::_bi::value<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::_bi::value<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::_bi::value<boost::function<void()> >, boost::_bi::value<impala::Promise<long int>*> >::operator()<void (*)(const std::basic_string<char>&, const std::basic_string<char>&, boost::function<void()>, impala::Promise<long int>*), boost::_bi::list0>(boost::_bi::type<void>, void (*&)(const std::basic_string<char, std::char_traits<char>, std::allocator<char> > &, const std::basic_string<char, std::char_traits<char>, std::allocator<char> > &, boost::function<void()>, impala::Promise<long> *), boost::_bi::list0 &, int) (
          this=0x7fbbb6d3d1c0, f=@0x7fbbb6d3d1b8, a=...)
          at /usr/src/debug/impala-2.7.0-cdh5.10.0-SNAPSHOT/toolchain/boost-1.57.0/include/boost/bind/bind.hpp:457
      #22 0x00000000015ed47d in boost::_bi::bind_t<void, void (*)(const std::basic_string<char, std::char_traits<char>, std::allocator<char> >&, const std::basic_string<char, std::char_traits<char>, std::allocator<char> >&, boost::function<void()>, impala::Promise<long int>*), boost::_bi::list4<boost::_bi::value<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::_bi::value<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::_bi::value<boost::function<void()> >, boost::_bi::value<impala::Promise<long int>*> > >::operator()(void) (this=0x7fbbb6d3d1b8)
          at /usr/src/debug/impala-2.7.0-cdh5.10.0-SNAPSHOT/toolchain/boost-1.57.0/include/boost/bind/bind_template.hpp:20
      #23 0x00000000015ed3d8 in boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(const std::basic_string<char, std::char_traits<char>, std::allocator<char> >&, const std::basic_string<char, std::char_traits<char>, std::allocator<char> >&, boost::function<void()>, impala::Promise<long int>*), boost::_bi::list4<boost::_bi::value<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::_bi::value<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::_bi::value<boost::function<void()> >, boost::_bi::value<impala::Promise<long int>*> > > >::run(void) (
          this=0x7fbbb6d3d000) at /usr/src/debug/impala-2.7.0-cdh5.10.0-SNAPSHOT/toolchain/boost-1.57.0/include/boost/thread/detail/thread.hpp:116
      #24 0x0000000001a399da in thread_proxy ()
      #25 0x000000323fa079d1 in start_thread (arg=0x7fbe6d7a8700) at pthread_create.c:301
      #26 0x000000323f6e88fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
      

      The Impala under test was at this commit:

      Run MT_DOP tests on all file formats.

        Activity

        Hide
        dhecht Dan Hecht added a comment -

        Looks like the code around setting completed_report_sent_ is racy:

        (thread 1) Executing in ReportStatusCb() sees completed_report_sent_ == 0, calls SendReport().
        (thread 2) Executing in UpdateStatus() succedes doing completed_report_sent_.CompareAndSwap(0,1)
        (thread 2) Sets status_ to a non-OK status.
        (thread 1) Loads status_.
        (thread 1) Tries completed_report_sent_.CompareAndSwap(0,1) which fails since already 1 so done==false.
        (thread 1) Calls ReportStatusCb() with !done && !status.ok(), hitting this DCHECK.
        (thread 2) Sends the "done" report.

        Show
        dhecht Dan Hecht added a comment - Looks like the code around setting completed_report_sent_ is racy: (thread 1) Executing in ReportStatusCb() sees completed_report_sent_ == 0, calls SendReport(). (thread 2) Executing in UpdateStatus() succedes doing completed_report_sent_.CompareAndSwap(0,1) (thread 2) Sets status_ to a non-OK status. (thread 1) Loads status_. (thread 1) Tries completed_report_sent_.CompareAndSwap(0,1) which fails since already 1 so done==false. (thread 1) Calls ReportStatusCb() with !done && !status.ok(), hitting this DCHECK. (thread 2) Sends the "done" report.
        Hide
        henryr Henry Robinson added a comment -

        Without going too deep into the code yet, if the PFE always relies on the report thread to send the completed report there'd be no need for this complex coordination between execution and reporting threads.

        Show
        henryr Henry Robinson added a comment - Without going too deep into the code yet, if the PFE always relies on the report thread to send the completed report there'd be no need for this complex coordination between execution and reporting threads.
        Hide
        dhecht Dan Hecht added a comment -

        Hi Michael Brown, has this showed up again? I'm just looking to understand how reproducible it.

        Show
        dhecht Dan Hecht added a comment - Hi Michael Brown , has this showed up again? I'm just looking to understand how reproducible it.
        Hide
        mikesbrown Michael Brown added a comment -

        Dan Hecht It has not shown up again: There have been 2 successful and 0 unsuccessful runs since the run that found this bug.

        Show
        mikesbrown Michael Brown added a comment - Dan Hecht It has not shown up again: There have been 2 successful and 0 unsuccessful runs since the run that found this bug.
        Hide
        henryr Henry Robinson added a comment -

        Agree with Dan Hecht who said offline this was likely caused by this line from IMPALA-2905:

        +  // If status is not OK, we need to make sure that only one sender sends a 'done'
         +  // response.
         +  // TODO: Clean all this up - move 'done' reporting to Close()?
         +  if (!done && !status.ok()) {
         +    done = completed_report_sent_.CompareAndSwap(0, 1);
         +  }
        

        https://github.com/apache/incubator-impala/commit/9f61397fc4d638aa78b37db2cd5b9c35b6deed94?diff=unified#diff-3290504116de07f4689505d5846b144eR420

        I think this line is trying to prevent two done reports being sent to the coordinator. Unfortunately I can't remember the exact bug I was trying to avoid (memory tells me that it was test_failpoints or test_cancellation, but I haven't managed to recreate the error with the old patchset). In any case, if this path is taken and done is not set to true, we'll hit the DCHECK in ReportStatusCb; there's clearly a bug.

        I do think the offending lines are trying to avoid a legitimate problem. It's something like:

        (status == OK)
        (thread 1) ReportProfile() calls SendReport(false)
        (thread 2) Calls UpdateStatus(status != ok) (perhaps from Exec())
        (thread 1) Reads status != ok, so calls report_status_cb_(..., ..., true)
        (thread 2) Calls SendReport(true) from UpdateStatus() and again hits report_status_cb_(..., ..., true)

        I think this leads to 2 * reports with done == true. The problem is that not all paths try to acquire the completed_report_sent_ flag. The spot fix here is probably not to send the report if you can't get completed_report_sent_ (because someone else will send the final report). But I think we agree the logic here needs a simplification.

        Show
        henryr Henry Robinson added a comment - Agree with Dan Hecht who said offline this was likely caused by this line from IMPALA-2905 : + // If status is not OK, we need to make sure that only one sender sends a 'done' + // response. + // TODO: Clean all this up - move 'done' reporting to Close()? + if (!done && !status.ok()) { + done = completed_report_sent_.CompareAndSwap(0, 1); + } https://github.com/apache/incubator-impala/commit/9f61397fc4d638aa78b37db2cd5b9c35b6deed94?diff=unified#diff-3290504116de07f4689505d5846b144eR420 I think this line is trying to prevent two done reports being sent to the coordinator. Unfortunately I can't remember the exact bug I was trying to avoid (memory tells me that it was test_failpoints or test_cancellation , but I haven't managed to recreate the error with the old patchset). In any case, if this path is taken and done is not set to true, we'll hit the DCHECK in ReportStatusCb ; there's clearly a bug. I do think the offending lines are trying to avoid a legitimate problem. It's something like: (status == OK) (thread 1) ReportProfile() calls SendReport(false) (thread 2) Calls UpdateStatus(status != ok) (perhaps from Exec()) (thread 1) Reads status != ok, so calls report_status_cb_(..., ..., true) (thread 2) Calls SendReport(true) from UpdateStatus() and again hits report_status_cb_(..., ..., true) I think this leads to 2 * reports with done == true . The problem is that not all paths try to acquire the completed_report_sent_ flag. The spot fix here is probably not to send the report if you can't get completed_report_sent_ (because someone else will send the final report). But I think we agree the logic here needs a simplification.
        Hide
        dhecht Dan Hecht added a comment -

        commit 56f4d0f592f129054985786287ad8ed8c241fc74
        Author: Dan Hecht <dhecht@cloudera.com>
        Date: Mon Nov 28 15:38:12 2016 -0800

        IMPALA-4504: fix races in PlanFragmentExecutor regarding status reporting

        The PlanFragmentExecutor has some state shared between the main
        execution thread and the periodic reporting thread that isn't
        synchronized properly. IMPALA-4504 describes one such problem, and that
        bug was introduced in an attempt to fix another similar race.

        Let's just simplify all of this and remove this shared state. Instead,
        the profile thread will always be responsible for sending periodic
        '!done' messages, and the main execution thread will always be
        responsible for sending the final 'done' message (after joining the
        periodic thread).

        This will allow for even more simplification, in particular the
        interaction between FragementExecState and PlanFragementExecutor, but
        I'm not doing that now as to avoid more conflicts with the MT work.

        Change-Id: I052b7b4fabb341ad27ad294cd5b0a53728d87d0e
        Reviewed-on: http://gerrit.cloudera.org:8080/5250
        Reviewed-by: Dan Hecht <dhecht@cloudera.com>
        Tested-by: Internal Jenkins

        Show
        dhecht Dan Hecht added a comment - commit 56f4d0f592f129054985786287ad8ed8c241fc74 Author: Dan Hecht <dhecht@cloudera.com> Date: Mon Nov 28 15:38:12 2016 -0800 IMPALA-4504 : fix races in PlanFragmentExecutor regarding status reporting The PlanFragmentExecutor has some state shared between the main execution thread and the periodic reporting thread that isn't synchronized properly. IMPALA-4504 describes one such problem, and that bug was introduced in an attempt to fix another similar race. Let's just simplify all of this and remove this shared state. Instead, the profile thread will always be responsible for sending periodic '!done' messages, and the main execution thread will always be responsible for sending the final 'done' message (after joining the periodic thread). This will allow for even more simplification, in particular the interaction between FragementExecState and PlanFragementExecutor, but I'm not doing that now as to avoid more conflicts with the MT work. Change-Id: I052b7b4fabb341ad27ad294cd5b0a53728d87d0e Reviewed-on: http://gerrit.cloudera.org:8080/5250 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins

          People

          • Assignee:
            dhecht Dan Hecht
            Reporter:
            mikesbrown Michael Brown
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development