Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
Impala 2.6.0
Description
The stress test started reporting some queries are returning incorrect results in some cases. The rate is very small, tens out of 10,000 queries. The symptoms seem a lot like IMPALA-2987, so I tried to provide some of the information that's included in that bug.
See for example for this build:
http://sandbox.jenkins.cloudera.com/job/Impala-Stress-Test-Physical/486/
This is the console log containing one of the failed queries:
00:38:57 Process Process-88: 00:38:57 Traceback (most recent call last): 00:38:57 File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap 00:38:57 self.run() 00:38:57 File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run 00:38:57 self._target(*self._args, **self._kwargs) 00:38:57 File "tests/stress/concurrent_select.py", line 625, in _start_single_runner 00:38:57 % (query.result_hash, report.result_hash, query.sql)) 00:38:57 Exception: Result hash mismatch; expected 3440873535184108466, got 3440873535183883670 00:38:57 Query: select 00:38:57 p_brand, 00:38:57 p_type, 00:38:57 p_size, 00:38:57 count(distinct ps_suppkey) as supplier_cnt 00:38:57 from 00:38:57 partsupp, 00:38:57 part 00:38:57 where 00:38:57 p_partkey = ps_partkey 00:38:57 and p_brand <> 'Brand#45' 00:38:57 and p_type not like 'MEDIUM POLISHED%' 00:38:57 and p_size in (49, 14, 23, 45, 19, 3, 36, 9) 00:38:57 and ps_suppkey not in ( 00:38:57 select 00:38:57 s_suppkey 00:38:57 from 00:38:57 supplier 00:38:57 where 00:38:57 s_comment like '%Customer%Complaints%' 00:38:57 ) 00:38:57 group by 00:38:57 p_brand, 00:38:57 p_type, 00:38:57 p_size 00:38:57 order by 00:38:57 supplier_cnt desc, 00:38:57 p_brand, 00:38:57 p_type, 00:38:57 p_size 00:38:57 00:38:57 3153 139801718388480 INFO:concurrent_select[374]:Checking for crashes 00:38:57 00:38:57 3153 139801718388480 INFO:concurrent_select[377]:No crashes detected
Using the saved artifacts from the Jenkins job (results.zip) we see that corresponds to query ID 447bd5594405d60:f8b5fcc30b7b2eb6. We do this by grepping the result in the directory containing all the queries and results. The filename is the query ID.
I grabbed the logs and query profiles off the cluster and put them on impala desktop in a directory with this bug ID. I can then run a small script in the dev user's homedir to get the query profile (~dev/print_profile).
The incorrect query's profile is:
Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail --------------------------------------------------------------------------------------------------------------------------------- 12:MERGING-EXCHANGE 1 51.893ms 51.893ms 27.84K 188.75K 0 -1.00 B UNPARTITIONED 07:SORT 8 49.124ms 216.409ms 27.84K 188.75K 24.02 MB 16.00 MB 06:AGGREGATE 8 1s876ms 2s426ms 27.84K 188.75K 2.52 MB 10.00 MB FINALIZE 11:AGGREGATE 8 9s396ms 12s008ms 11.82M 3.20M 266.11 MB 257.14 MB 10:EXCHANGE 8 373.182ms 1s089ms 11.83M 3.20M 0 0 HASH(p_brand,p_type,p_size) 05:AGGREGATE 8 3s818ms 4s954ms 11.83M 3.20M 12.58 MB 257.14 MB STREAMING 04:HASH JOIN 8 4s531ms 5s881ms 11.83M 3.20M 10.20 MB 9.07 MB NULL AWARE LEFT ANTI JOIN, ... |--09:EXCHANGE 8 406.771us 2.317ms 479 100.00K 0 0 BROADCAST | 02:SCAN HDFS 7 4s091ms 5s725ms 479 100.00K 16.15 MB 48.00 MB tpch_100_parquet.supplier 03:HASH JOIN 8 23s811ms 27s344ms 11.83M 3.20M 266.17 MB 64.29 MB INNER JOIN, BROADCAST |--08:EXCHANGE 8 3s183ms 3s883ms 2.97M 800.00K 0 0 BROADCAST | 01:SCAN HDFS 7 3s915ms 4s865ms 2.97M 800.00K 96.45 MB 352.00 MB tpch_100_parquet.part 00:SCAN HDFS 8 4s051ms 6s761ms 80.00M 80.00M 39.94 MB 176.00 MB tpch_100_parquet.partsupp
Compare this to a profile of the same query that returned correct results. There are many such runs since we re-run the queries. Here's 134425b2e619c627:48b806a7130653b4:
Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail --------------------------------------------------------------------------------------------------------------------------------- 12:MERGING-EXCHANGE 1 27.808ms 27.808ms 27.84K 188.75K 0 -1.00 B UNPARTITIONED 07:SORT 8 13.794ms 21.751ms 27.84K 188.75K 24.02 MB 16.00 MB 06:AGGREGATE 8 1s321ms 1s714ms 27.84K 188.75K 2.52 MB 10.00 MB FINALIZE 11:AGGREGATE 8 1s860ms 2s663ms 11.88M 3.20M 171.10 MB 257.14 MB 10:EXCHANGE 8 717.939ms 1s620ms 11.88M 3.20M 0 0 HASH(p_brand,p_type,p_size) 05:AGGREGATE 8 3s784ms 5s524ms 11.88M 3.20M 12.71 MB 257.14 MB STREAMING 04:HASH JOIN 8 234.658ms 325.862ms 11.88M 3.20M 2.69 MB 9.07 MB NULL AWARE LEFT ANTI JOIN, ... |--09:EXCHANGE 8 158.408us 333.299us 479 100.00K 0 0 BROADCAST | 02:SCAN HDFS 7 1s807ms 2s859ms 479 100.00K 16.15 MB 48.00 MB tpch_100_parquet.supplier 03:HASH JOIN 8 3s331ms 3s921ms 11.89M 3.20M 394.08 MB 64.29 MB INNER JOIN, BROADCAST |--08:EXCHANGE 8 2s387ms 6s663ms 2.97M 800.00K 0 0 BROADCAST | 01:SCAN HDFS 7 1s452ms 2s243ms 2.97M 800.00K 96.28 MB 352.00 MB tpch_100_parquet.part 00:SCAN HDFS 8 1s084ms 2s400ms 80.00M 80.00M 39.99 MB 176.00 MB tpch_100_parquet.partsupp
What follows is the full list of query IDs that were reported as having hash mismatches:
447bd5594405d60:f8b5fcc30b7b2eb6 5642c03e84fec850:e4c4e7d4b57e5291 bf4ada8ece00f3b4:31e11d41904adfa4 2d4e8ec89dc9f072:67cf3c6843ed7bd a46cf7d01964dff:9e9eabb44ca2ca9a af4490619ef34246:f08c4695612c4c85 7443d268a0ba7201:3893ffcc845a78d e34dfe2fb3ff9f33:c0d18ac4f45b978b f4ac95f838ad22e:1f9a0a66dafa46a1 7d4713f6b51241b1:a8c7c8d00ea6a49c
It's interesting to note these 10 failures are from the same 2 queries. Some fail on the query above. The other query is this:
select c_custkey, c_name, sum(l_extendedprice * (1 - l_discount)) as revenue, c_acctbal, n_name, c_address, c_phone, c_comment from customer, orders, lineitem, nation where c_custkey = o_custkey and l_orderkey = o_orderkey and o_orderdate >= '1993-10-01' and o_orderdate < '1994-01-01' and l_returnflag = 'R' and c_nationkey = n_nationkey group by c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment order by revenue desc limit 20
I'll note that this started on April 21 but the test framework didn't fail the job. IMPALA-3487 filed to change that. What brought this to my attention was a crash that happened yesterday in the same job. That's IMPALA-3485.
I think everything someone needs to look into this further is in the appropriate directory on impala-desktop.