Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-477

Unexpected LLVM Crash When Querying Doubles on CentOS 5.x

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: Impala 1.0.1
    • Fix Version/s: Impala 1.1.1
    • Component/s: None
    • Labels:
    • Environment:
      * CentOS 5.x
      * Newer hardware (dmidecode attached)

      Description

      We have a very strange LLVM issue on our cluster which has been prohibiting us from using codegen for quite some time.

      Symptoms

      The symptoms are simple, execute a query which utilizing runtime code generation on a DOUBLE column, the query will not only fail, but crash all the daemons which took part in the query. The following behavior is observed when issuing the query from the shell

      Query: <some offending codegen query>
      Query finished, fetching results ...
      Error communicating with impalad: TSocket read 0 bytes
      

      From an operations point of view, these nodes die off with no warning, or error. With GLOG set to 2, the last known message was that LLVM was going to be used.

      The Queries

      As mentioned in the title, the crashes seem to only be reproducible with doubles.

      Here is an example query with LLVM on and off, calculating the AVG of a column.

      The Table

      +-----------------+--------+---------+
      | name            | type   | comment |
      +-----------------+--------+---------+
      | o_orderkey      | int    |         |
      | o_custkey       | int    |         |
      | o_orderstatus   | string |         |
      | o_totalprice    | double |         |
      | o_orderdate     | string |         |
      | o_orderpriority | string |         |
      | o_clerk         | string |         |
      | o_shippriority  | int    |         |
      | o_comment       | string |         |
      +-----------------+--------+---------+
      

      LLVM ON

      [impala-node:21000] > SELECT AVG(o_totalprice) FROM orders;
      Query: select AVG(o_totalprice) FROM orders
      Unknown Exception : (104, 'Connection reset by peer')
      Query aborted, unable to fetch data
      

      LLVM OFF

      [impala-node:21000] > SET DISABLE_CODEGEN=true;
      DISABLE_CODEGEN set to true
      [impala-node:21000] > SELECT AVG(o_totalprice) FROM orders;
      Query: select AVG(o_totalprice) FROM orders
      Query finished, fetching results ...
      +-------------------+
      | avg(o_totalprice) |
      +-------------------+
      | 141826.4553346672 |
      +-------------------+
      Returned 1 row(s) in 1.15s
      

      With GLOG = 2, here is the last observed line before the daemon has an unexpected crash.

      I0619 15:35:03.984278   308 hdfs-scanner.cc:105] HdfsTextScanner(node_id=0) using llvm codegend functions.
      EOF
      

      Doubles Only Reasoning

      The issue seems to only affect doubles, not only can I use AVG, MAX, etc.. with INTs, but I think this query shows it the best.

      LLVM ON With INTS

      [impala-node:21000] > SELECT DISTINCT(o_custkey = 1) FROM orders;
      Query: select DISTINCT(o_custkey = 1) FROM orders
      Query finished, fetching results ...
      +---------------+
      | o_custkey = 1 |
      +---------------+
      | true          |
      | false         |
      +---------------+
      Returned 2 row(s) in 0.74s
      

      LVM ON With INTS & DOUBLES
      1 to 1.0

      [impala-node:21000] > SELECT DISTINCT(o_custkey = 1.0) FROM orders;
      Query: select DISTINCT(o_custkey = 1.0) FROM orders
      Query finished, fetching results ...
      Error communicating with impalad: TSocket read 0 bytes
      

      Configuration Testing

      This bug only seems to occurr on this specific hardware and CentOS 5.x. I reformatted this node to CentOS 6.x and this issue effectively dissapeared, it seems to only affect this specific hardware on this family of Red Hat.

      In addition, I attempted to reproduce this using systest, virtual machines, etc on CentOS 5.x and was unsuccessful.

      Logs

      Attached is a GLOG 2 version of an LLVM query failing. I tried to analyze a core dump, but since there is compiled code in there, GDB is unable to look into it due to a Unsupported JIT Version error...

      With --module_output=/tmp/module.out added, module output is written on ALL LLVM queries except the one's that crash. I'm assuming those queries are crashing before it gets to the point of writing module output.

        Attachments

        1. dmidecode_cott-not-working.txt
          38 kB
          Ricky Saltzer
        2. dmidecode_cott-working.txt
          22 kB
          Ricky Saltzer
        3. dmidecode_impala-user-thread.txt
          47 kB
          Ricky Saltzer

          Activity

            People

            • Assignee:
              nong_impala_60e1 Nong Li
              Reporter:
              rickysaltzer Ricky Saltzer
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: