Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
Impala 1.0.1
-
None
-
* CentOS 5.x
* Newer hardware (dmidecode attached)
Description
We have a very strange LLVM issue on our cluster which has been prohibiting us from using codegen for quite some time.
Symptoms
The symptoms are simple, execute a query which utilizing runtime code generation on a DOUBLE column, the query will not only fail, but crash all the daemons which took part in the query. The following behavior is observed when issuing the query from the shell
Query: <some offending codegen query> Query finished, fetching results ... Error communicating with impalad: TSocket read 0 bytes
From an operations point of view, these nodes die off with no warning, or error. With GLOG set to 2, the last known message was that LLVM was going to be used.
The Queries
As mentioned in the title, the crashes seem to only be reproducible with doubles.
Here is an example query with LLVM on and off, calculating the AVG of a column.
The Table
+-----------------+--------+---------+ | name | type | comment | +-----------------+--------+---------+ | o_orderkey | int | | | o_custkey | int | | | o_orderstatus | string | | | o_totalprice | double | | | o_orderdate | string | | | o_orderpriority | string | | | o_clerk | string | | | o_shippriority | int | | | o_comment | string | | +-----------------+--------+---------+
LLVM ON
[impala-node:21000] > SELECT AVG(o_totalprice) FROM orders; Query: select AVG(o_totalprice) FROM orders Unknown Exception : (104, 'Connection reset by peer') Query aborted, unable to fetch data
LLVM OFF
[impala-node:21000] > SET DISABLE_CODEGEN=true; DISABLE_CODEGEN set to true [impala-node:21000] > SELECT AVG(o_totalprice) FROM orders; Query: select AVG(o_totalprice) FROM orders Query finished, fetching results ... +-------------------+ | avg(o_totalprice) | +-------------------+ | 141826.4553346672 | +-------------------+ Returned 1 row(s) in 1.15s
With GLOG = 2, here is the last observed line before the daemon has an unexpected crash.
I0619 15:35:03.984278 308 hdfs-scanner.cc:105] HdfsTextScanner(node_id=0) using llvm codegend functions. EOF
Doubles Only Reasoning
The issue seems to only affect doubles, not only can I use AVG, MAX, etc.. with INTs, but I think this query shows it the best.
LLVM ON With INTS
[impala-node:21000] > SELECT DISTINCT(o_custkey = 1) FROM orders; Query: select DISTINCT(o_custkey = 1) FROM orders Query finished, fetching results ... +---------------+ | o_custkey = 1 | +---------------+ | true | | false | +---------------+ Returned 2 row(s) in 0.74s
LVM ON With INTS & DOUBLES
1 to 1.0
[impala-node:21000] > SELECT DISTINCT(o_custkey = 1.0) FROM orders; Query: select DISTINCT(o_custkey = 1.0) FROM orders Query finished, fetching results ... Error communicating with impalad: TSocket read 0 bytes
Configuration Testing
This bug only seems to occurr on this specific hardware and CentOS 5.x. I reformatted this node to CentOS 6.x and this issue effectively dissapeared, it seems to only affect this specific hardware on this family of Red Hat.
In addition, I attempted to reproduce this using systest, virtual machines, etc on CentOS 5.x and was unsuccessful.
Logs
Attached is a GLOG 2 version of an LLVM query failing. I tried to analyze a core dump, but since there is compiled code in there, GDB is unable to look into it due to a Unsupported JIT Version error...
With --module_output=/tmp/module.out added, module output is written on ALL LLVM queries except the one's that crash. I'm assuming those queries are crashing before it gets to the point of writing module output.