Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
Impala 2.5.0
Description
5 out of ~300 queries in a stress test run on the physical cluster had incorrect results running TPC-H queries on parquet (they are not all the same query).
The archived profile for one of the queries shows different numbers (especially for 2 of the exchanges).
Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail ---------------------------------------------------------------------------------------------------------------------------------- 15:MERGING-EXCHANGE 1 134.504us 134.504us 140 20 0 -1.00 B UNPARTITIONED 08:TOP-N 7 243.026ms 269.440ms 140 20 68.00 KB 4.88 KB 14:AGGREGATE 7 1m10s 8m4s 3.33M 20.00M 154.78 MB 5.11 GB FINALIZE 13:EXCHANGE 7 171.997ms 212.488ms 3.33M 20.00M 0 0 HASH(c_custkey,c_name,c_acc... 07:AGGREGATE 7 1m10s 8m3s 3.33M 20.00M 186.28 MB 5.11 GB 06:HASH JOIN 7 142.926ms 182.497ms 9.83M 20.00M 2.04 MB 855.00 B INNER JOIN, BROADCAST |--12:EXCHANGE 7 6.539us 7.819us 25 25 0 0 BROADCAST | 03:SCAN HDFS 1 6.666ms 6.666ms 25 25 58.00 KB 32.00 MB tpch_100_parquet.nation 05:HASH JOIN 7 4s179ms 4s900ms 11.46M 20.00M 586.03 MB 491.14 MB INNER JOIN, PARTITIONED |--11:EXCHANGE 7 1m39s 11m27s 15.00M 15.00M 0 0 HASH(c_custkey) | 00:SCAN HDFS 7 857.603ms 1s963ms 15.00M 15.00M 189.95 MB 616.00 MB tpch_100_parquet.customer 10:EXCHANGE 7 124.665ms 214.146ms 11.46M 20.00M 0 0 HASH(o_custkey) 04:HASH JOIN 7 1m21s 8m14s 11.46M 20.00M 650.02 MB 660.90 MB INNER JOIN, BROADCAST |--09:EXCHANGE 7 1s092ms 1s890ms 5.73M 15.00M 0 0 BROADCAST | 01:SCAN HDFS 7 2m19s 16m6s 5.73M 15.00M 141.39 MB 264.00 MB tpch_100_parquet.orders 02:SCAN HDFS 7 1s190ms 1s563ms 148.07M 200.01M 72.87 MB 352.00 MB tpch_100_parquet.lineitem
A correct summary is
+---------------------+--------+----------+----------+---------+------------+-----------+---------------+---------------------------------------------------------------------+ | Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail | +---------------------+--------+----------+----------+---------+------------+-----------+---------------+---------------------------------------------------------------------+ | 15:MERGING-EXCHANGE | 1 | 124.17us | 124.17us | 140 | 20 | 0 B | -1 B | UNPARTITIONED | | 08:TOP-N | 7 | 164.38ms | 183.72ms | 140 | 20 | 68.00 KB | 4.88 KB | | | 14:AGGREGATE | 7 | 727.67ms | 750.61ms | 3.88M | 20.00M | 194.13 MB | 5.11 GB | FINALIZE | | 13:EXCHANGE | 7 | 81.16ms | 84.50ms | 3.88M | 20.00M | 0 B | 0 B | HASH(c_custkey,c_name,c_acctbal,c_phone,n_name,c_address,c_comment) | | 07:AGGREGATE | 7 | 958.25ms | 1.01s | 3.88M | 20.00M | 186.28 MB | 5.11 GB | | | 06:HASH JOIN | 7 | 67.95ms | 70.91ms | 11.46M | 20.00M | 2.04 MB | 855 B | INNER JOIN, BROADCAST | | |--12:EXCHANGE | 7 | 8.83us | 10.81us | 175 | 25 | 0 B | 0 B | BROADCAST | | | 03:SCAN HDFS | 1 | 167.55ms | 167.55ms | 25 | 25 | 58.00 KB | 32.00 MB | tpch_100_parquet.nation | | 05:HASH JOIN | 7 | 714.31ms | 750.06ms | 11.46M | 20.00M | 586.03 MB | 491.14 MB | INNER JOIN, PARTITIONED | | |--11:EXCHANGE | 7 | 277.96ms | 286.59ms | 15.00M | 15.00M | 0 B | 0 B | HASH(c_custkey) | | | 00:SCAN HDFS | 7 | 449.21ms | 1.05s | 15.00M | 15.00M | 317.50 MB | 616.00 MB | tpch_100_parquet.customer | | 10:EXCHANGE | 7 | 45.12ms | 49.42ms | 11.46M | 20.00M | 0 B | 0 B | HASH(o_custkey) | | 04:HASH JOIN | 7 | 2.90s | 3.01s | 11.46M | 20.00M | 650.02 MB | 660.90 MB | INNER JOIN, BROADCAST | | |--09:EXCHANGE | 7 | 503.27ms | 560.62ms | 40.11M | 15.00M | 0 B | 0 B | BROADCAST | | | 01:SCAN HDFS | 7 | 730.01ms | 858.07ms | 5.73M | 15.00M | 228.06 MB | 264.00 MB | tpch_100_parquet.orders | | 02:SCAN HDFS | 7 | 136.49ms | 153.02ms | 148.07M | 200.01M | 702.55 MB | 352.00 MB | tpch_100_parquet.lineitem | +---------------------+--------+----------+----------+---------+------------+-----------+---------------+---------------------------------------------------------------------+
The logs have messages like
data-stream-mgr.cc:111] Datastream sender timed-out waiting for recvr for fragment instance: 72486d469653b918:e02b80cc64dfe98d (time-out was: 1m). If query was cancelled, this is not an error.
But many other queries have the same messages. (I haven't check if they returned results, they could have errored due to mem limits.)