Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4929

Unsafe concurrent access to LlvmCodeGen::fn_refs_map_ can lead to unresolved symbols in LLVM

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: Impala 2.8.0
    • Fix Version/s: Impala 2.8.0
    • Component/s: Backend
    • Labels:
    • Environment:

      Description

      (gdb) bt
      #0 0x0000003aa7c32635 in raise () from /lib64/libc.so.6
      #1 0x0000003aa7c33e15 in abort () from /lib64/libc.so.6
      #2 0x00007fb2c22fa8f4 in google::DumpStackTraceAndExit () at src/utilities.cc:147
      #3 0x00007fb2c22f33fd in google::LogMessage::Fail () at src/logging.cc:1315
      #4 0x00007fb2c22f514d in google::LogMessage::SendToLog (this=<optimized out>) at src/logging.cc:1269
      #5 0x00007fb2c22f2f0a in google::LogMessage::Flush (this=this@entry=0x7fad53e7c6a0) at src/logging.cc:1138
      #6 0x00007fb2c22f5c0e in google::LogMessageFatal::~LogMessageFatal (this=0x7fad53e7c6a0, __in_chrg=<optimized out>) at src/logging.cc:1836
      #7 0x0000000000e7bbf8 in impala::LlvmCodegenHandleError (user_data=<optimized out>, reason=..., gen_crash_diag=<optimized out>) at /export/ldb/online/impala_gitlab/be/src/codegen/llvm-codegen.cc:109
      #8 0x0000000001d6b983 in llvm::report_fatal_error(llvm::Twine const&, bool) ()
      #9 0x0000000001afc0ef in llvm::RuntimeDyldImpl::resolveExternalSymbols() ()
      #10 0x0000000001afc114 in llvm::RuntimeDyldImpl::resolveRelocations() ()
      #11 0x00000000019a1b2f in llvm::MCJIT::finalizeLoadedModules() ()
      #12 0x00000000019a1eb1 in llvm::MCJIT::finalizeObject() ()
      #13 0x0000000000e83e65 in impala::LlvmCodeGen::FinalizeModule (this=0x7faa32e8cd00) at /export/ldb/online/impala_gitlab/be/src/codegen/llvm-codegen.cc:931
      #14 0x0000000001076daf in impala::PlanFragmentExecutor::OptimizeLlvmModule (this=this@entry=0x2b3465b8) at /export/ldb/online/impala_gitlab/be/src/runtime/plan-fragment-executor.cc:268
      #15 0x000000000107823a in impala::PlanFragmentExecutor::OpenInternal (this=this@entry=0x2b3465b8) at /export/ldb/online/impala_gitlab/be/src/runtime/plan-fragment-executor.cc:320
      #16 0x0000000001079a88 in impala::PlanFragmentExecutor::Open (this=this@entry=0x2b3465b8) at /export/ldb/online/impala_gitlab/be/src/runtime/plan-fragment-executor.cc:296
      #17 0x0000000001073330 in impala::FragmentInstanceState::Exec (this=this@entry=0x2b346300) at /export/ldb/online/impala_gitlab/be/src/runtime/fragment-instance-state.cc:65
      #18 0x000000000107be1f in impala::QueryExecMgr::ExecFInstance (this=0x930f660, fis=0x2b346300) at /export/ldb/online/impala_gitlab/be/src/runtime/query-exec-mgr.cc:109
      #19 0x0000000000e25e20 in boost::function0<void>::operator() (this=0x7fad53e7da60) at /usr/local/include/boost/function/function_template.hpp:767
      #20 impala::Thread::SuperviseThread(std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>*) (name=..., category=..., functor=..., thread_started=0x7fb26cd33f20)
      at /export/ldb/online/impala_gitlab/be/src/util/thread.cc:317
      #21 0x0000000000e26784 in boost::_bi::list4<boost::_bi::value<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::_bi::value<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, boost::_bi::value<boost::function<void()> >, boost::_bi::value<impala::Promise<long int>> >::operator()<void (const std::basic_string<char>&, const std::basic_string<char>&, boost::function<void()>, impala::Promise<long int>), boost::_bi::list0> (
      f=@0x7fa1162dd1b8: 0xe25c00 <impala::Thread::SuperviseThread(std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>*)>, a=<synthetic pointer>, this=0x7fa1162dd1c0)
      at /usr/local/include/boost/bind/bind.hpp:457
      #22 boost::_bi::bind_t<void, void (std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>), boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, boost::_bi::value<impala::Promise<long>> > >::operator()() (this=0x7fa1162dd1b8) at /usr/local/include/boost/bind/bind_template.hpp:20
      #23 boost::detail::thread_data<boost::_bi::bind_t<void, void (std::string const&, std::string const&, boost::function<void ()>, impala::Promise<long>), boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()> >, boost::_bi::value<impala::Promise<long>> > > >::run() (this=0x7fa1162dd000) at /usr/local/include/boost/thread/detail/thread.hpp:116
      #24 0x00007fb2c516b0f5 in thread_proxy () from /export/ldb/servers/impala-snapshot/lib_C/libboost_thread.so.1.57.0
      #25 0x0000003aa84079d1 in start_thread () from /lib64/libpthread.so.0
      #26 0x0000003aa7ce886d in clone () from /lib64/libc.so.6
      (gdb)

        Activity

        Hide
        fish0515_impala_49b1 fishing added a comment -

        F0214 16:50:55.112761 18052 llvm-codegen.cc:109] LLVM hit fatal error: Program used external function '_ZN6impala11StringValue13FromStringValERKN10impala_udf9StringValE' which could not be resolved!

        Show
        fish0515_impala_49b1 fishing added a comment - F0214 16:50:55.112761 18052 llvm-codegen.cc:109] LLVM hit fatal error: Program used external function '_ZN6impala11StringValue13FromStringValERKN10impala_udf9StringValE' which could not be resolved!
        Hide
        kwho Michael Ho added a comment -

        I was about to say this could be IMPALA-4705 but it seems '_ZN6impala11StringValue13FromStringValERKN10impala_udf9StringValE' shouldn't be affected. Would you mind sharing the query which triggered the problem ? What version of Impala did you hit this one on ?

        Show
        kwho Michael Ho added a comment - I was about to say this could be IMPALA-4705 but it seems '_ZN6impala11StringValue13FromStringValERKN10impala_udf9StringValE' shouldn't be affected. Would you mind sharing the query which triggered the problem ? What version of Impala did you hit this one on ?
        Hide
        kwho Michael Ho added a comment -

        Nevermind about the Impala version. It's at https://github.com/apache/incubator-impala/commit/6a9df540967e07b09524268d0cc52b7d10835676 according to your previous updates.

        Show
        kwho Michael Ho added a comment - Nevermind about the Impala version. It's at https://github.com/apache/incubator-impala/commit/6a9df540967e07b09524268d0cc52b7d10835676 according to your previous updates.
        Hide
        kwho Michael Ho added a comment -

        I tried a couple of queries which call the function '_ZN6impala11StringValue13FromStringValERKN10impala_udf9StringValE' but couldn't quite get the problem to reproduce:

        [localhost:21000] > select count(*) from tpch_parquet.lineitem where instr(l_shipmode, 'XXX') = 1;
        Query: select count(*) from tpch_parquet.lineitem where instr(l_shipmode, 'XXX') = 1
        +----------+
        | count(*) |
        +----------+
        | 0        |
        +----------+
        Fetched 1 row(s) in 0.22s
        
        [localhost:21000] > select count(distinct l_partkey) from tpch_parquet.lineitem group by l_shipmode;
        Query: select count(distinct l_partkey) from tpch_parquet.lineitem group by l_shipmode
        +---------------------------+
        | count(distinct l_partkey) |
        +---------------------------+
        | 197338                    |
        | 197327                    |
        | 197337                    |
        | 197267                    |
        | 197241                    |
        | 197228                    |
        | 197228                    |
        +---------------------------+
        
        [localhost:21000] > select max(l_shipmode) from tpch_parquet.lineitem;
        Query: select max(l_shipmode) from tpch_parquet.lineitem
        +-----------------+
        | max(l_shipmode) |
        +-----------------+
        | TRUCK           |
        +-----------------+
        
        Show
        kwho Michael Ho added a comment - I tried a couple of queries which call the function '_ZN6impala11StringValue13FromStringValERKN10impala_udf9StringValE' but couldn't quite get the problem to reproduce: [localhost:21000] > select count(*) from tpch_parquet.lineitem where instr(l_shipmode, 'XXX') = 1; Query: select count(*) from tpch_parquet.lineitem where instr(l_shipmode, 'XXX') = 1 +----------+ | count(*) | +----------+ | 0 | +----------+ Fetched 1 row(s) in 0.22s [localhost:21000] > select count(distinct l_partkey) from tpch_parquet.lineitem group by l_shipmode; Query: select count(distinct l_partkey) from tpch_parquet.lineitem group by l_shipmode +---------------------------+ | count(distinct l_partkey) | +---------------------------+ | 197338 | | 197327 | | 197337 | | 197267 | | 197241 | | 197228 | | 197228 | +---------------------------+ [localhost:21000] > select max(l_shipmode) from tpch_parquet.lineitem; Query: select max(l_shipmode) from tpch_parquet.lineitem +-----------------+ | max(l_shipmode) | +-----------------+ | TRUCK | +-----------------+
        Hide
        fish0515_impala_49b1 fishing added a comment -

        use error_query.sql could not reproduce

        Show
        fish0515_impala_49b1 fishing added a comment - use error_query.sql could not reproduce
        Hide
        kwho Michael Ho added a comment -

        I took a look at that query and the UDF invoked in it and tried emulating the codegen path exercised with the following query but so far no repro.

        [localhost:21000] > select count(distinct concat(substr(l_comment,8))) from tpch_parquet.lineitem, functional_parquet.alltypes where instr(l_shipmode, 'a') = datediff(timestamp_col, timestamp_col) group by l_shipmode;
        Query: select count(distinct concat(substr(l_comment,8))) from tpch_parquet.lineitem, functional_parquet.alltypes where instr(l_shipmode, 'a') = datediff(timestamp_col, timestamp_col) group by l_shipmode
        

        Is there any chance you can trigger the problem with a simpler query ? It's hard to make forward progress with this complicated query especially when I don't have the schema available.
        One thing I can consider trying if you provide the schema is that I can substitute the tables in the query with similar tables.

        Show
        kwho Michael Ho added a comment - I took a look at that query and the UDF invoked in it and tried emulating the codegen path exercised with the following query but so far no repro. [localhost:21000] > select count(distinct concat(substr(l_comment,8))) from tpch_parquet.lineitem, functional_parquet.alltypes where instr(l_shipmode, 'a') = datediff(timestamp_col, timestamp_col) group by l_shipmode; Query: select count(distinct concat(substr(l_comment,8))) from tpch_parquet.lineitem, functional_parquet.alltypes where instr(l_shipmode, 'a') = datediff(timestamp_col, timestamp_col) group by l_shipmode Is there any chance you can trigger the problem with a simpler query ? It's hard to make forward progress with this complicated query especially when I don't have the schema available. One thing I can consider trying if you provide the schema is that I can substitute the tables in the query with similar tables.
        Hide
        kwho Michael Ho added a comment -

        Btw, this kind of bug should reproduce quite consistently (as in IMPALA-4705) with the same query. Did something change since it last happened or it was triggered by another query ?

        Show
        kwho Michael Ho added a comment - Btw, this kind of bug should reproduce quite consistently (as in IMPALA-4705 ) with the same query. Did something change since it last happened or it was triggered by another query ?
        Hide
        fish0515_impala_49b1 fishing added a comment -

        search failed instance thread id 18052 and execute time . then found instance_id , then can confirm that the failure query is error_query.sql . but now use this query cound not repro. so I am confused

        Show
        fish0515_impala_49b1 fishing added a comment - search failed instance thread id 18052 and execute time . then found instance_id , then can confirm that the failure query is error_query.sql . but now use this query cound not repro. so I am confused
        Hide
        kwho Michael Ho added a comment -

        Please reopen if you can reproduce this problem again. We'd like to find out what's going on.

        Show
        kwho Michael Ho added a comment - Please reopen if you can reproduce this problem again. We'd like to find out what's going on.
        Hide
        fish0515_impala_49b1 fishing added a comment -

        OK

        Show
        fish0515_impala_49b1 fishing added a comment - OK
        Hide
        kwho Michael Ho added a comment -

        The root cause of the problem is that LlvmCodeGen::fn_refs_map_ which is supposed to be read only may be modified by LlvmCodeGen::MaterializeFunctionHelper() as new entries are inadvertently created due to the use of [ ] operator. Multiple plan fragments accessing the fn_refs_map_ may get wrong or missing callee functions as a result of this unsafe modification to fn_refs_map_.

        Show
        kwho Michael Ho added a comment - The root cause of the problem is that LlvmCodeGen::fn_refs_map_ which is supposed to be read only may be modified by LlvmCodeGen::MaterializeFunctionHelper() as new entries are inadvertently created due to the use of [ ] operator. Multiple plan fragments accessing the fn_refs_map_ may get wrong or missing callee functions as a result of this unsafe modification to fn_refs_map_.
        Hide
        fish0515_impala_49b1 fishing added a comment -

        Michael Ho almost rarely appear, the symptom like you said unsafe map cause

        Show
        fish0515_impala_49b1 fishing added a comment - Michael Ho almost rarely appear, the symptom like you said unsafe map cause
        Hide
        kwho Michael Ho added a comment -

        https://github.com/apache/incubator-impala/commit/5f2fcd9f6ae97c52f395f509b7b221a3b27d2c45

        MPALA-4929: Safe concurrent access to IR function call graph
        Previously, LlvmCodeGen creates a map (fn_refs_map_) which
        maps an IR function to its set of callees. That map was initialized
        once at Impalad's start time and it is supposed to be read-only
        afterwards.

        However, the map was unintentionally modified by its user
        LlvmCodeGen::MaterializeFunctionHelper() due to the use of
        operator []. In particular, that operator may create a new
        entry in the map if it doesn't exist already. This is possible
        because the map initially only contains entries for functions
        with at least one callee function. Using the operator [] with
        functions with no callee function may modify the map. Since the
        map is shared by all fragment instances, such unsafe concurrent
        modification can cause missing or wrong callee functions to be
        materialized, leading to failure to resolve symbols in LLVM.

        This change fixes the problem above by:
        1. Create an entry for all functions even if it has no callee.
        In fact, LlvmCodeGen::LinkModule() assumes all functions
        defined in main module will have entries in the map.

        2. Introduce a new class CodegenCallGraph which encapsulates
        the map described above. The map was established once at
        initialization time and there is no interface to modify the
        map afterwards, thus preventing any accidental modification
        to the map at run time.

        Change-Id: I1acd6bad80341121c8189d817e0fe62f2862f28a
        Reviewed-on: http://gerrit.cloudera.org:8080/6326
        Reviewed-by: Michael Ho <kwho@cloudera.com>
        Tested-by: Impala Public Jenkins

        Show
        kwho Michael Ho added a comment - https://github.com/apache/incubator-impala/commit/5f2fcd9f6ae97c52f395f509b7b221a3b27d2c45 MPALA-4929: Safe concurrent access to IR function call graph Previously, LlvmCodeGen creates a map (fn_refs_map_) which maps an IR function to its set of callees. That map was initialized once at Impalad's start time and it is supposed to be read-only afterwards. However, the map was unintentionally modified by its user LlvmCodeGen::MaterializeFunctionHelper() due to the use of operator []. In particular, that operator may create a new entry in the map if it doesn't exist already. This is possible because the map initially only contains entries for functions with at least one callee function. Using the operator [] with functions with no callee function may modify the map. Since the map is shared by all fragment instances, such unsafe concurrent modification can cause missing or wrong callee functions to be materialized, leading to failure to resolve symbols in LLVM. This change fixes the problem above by: 1. Create an entry for all functions even if it has no callee. In fact, LlvmCodeGen::LinkModule() assumes all functions defined in main module will have entries in the map. 2. Introduce a new class CodegenCallGraph which encapsulates the map described above. The map was established once at initialization time and there is no interface to modify the map afterwards, thus preventing any accidental modification to the map at run time. Change-Id: I1acd6bad80341121c8189d817e0fe62f2862f28a Reviewed-on: http://gerrit.cloudera.org:8080/6326 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Impala Public Jenkins

          People

          • Assignee:
            kwho Michael Ho
            Reporter:
            fish0515_impala_49b1 fishing
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development