Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
Impala 3.1.0
-
None
-
ghx-label-7
Description
Certain queries have the pattern of ordery-by with a limit clause. Due to the limit clause, we may not include all the rows for a given order-by keys in the output. It's non-deterministic that which rows will be included in the output. Given the stress test will store the hashes of the query result obtained during binary search, we may sometimes get false alarm on result mis-match. One way to fix it is that we need to order all the columns so that the results will always be deterministic.
For instance, the following TPC-H Q3 query may have multiple correct results:
Default Db: tpch_10000_parquet Sql Statement: /* Mem: 8152 MB. Coordinator: vc1503.halxg.cloudera.com. */ select l_orderkey, sum(l_extendedprice * (1 - l_discount)) as revenue, o_orderdate, o_shippriority from customer, orders, lineitem where c_mktsegment = 'BUILDING' and c_custkey = o_custkey and l_orderkey = o_orderkey and o_orderdate < '1995-03-15' and l_shipdate > '1995-03-15' group by l_orderkey, o_orderdate, o_shippriority order by revenue desc, o_orderdate limit 10
Correct result 1:
+-------------+-------------+-------------+----------------+ | l_orderkey | revenue | o_orderdate | o_shippriority | +-------------+-------------+-------------+----------------+ | 18134335556 | 510414.0360 | 1995-02-13 | 0 | | 52494073892 | 510414.0360 | 1995-02-13 | 0 | | 26724270146 | 510414.0360 | 1995-02-13 | 0 | | 954466400 | 510414.0360 | 1995-02-13 | 0 | | 9544400966 | 510414.0360 | 1995-02-13 | 0 | | 35314204736 | 510414.0360 | 1995-02-13 | 0 | | 43904139302 | 510414.0360 | 1995-02-13 | 0 | | 7133834945 | 503732.3318 | 1995-02-25 | 0 | | 24313704101 | 503732.3318 | 1995-02-25 | 0 | | 58673442437 | 503732.3318 | 1995-02-25 | 0 | +-------------+-------------+-------------+----------------+
Correct result 2:
+-------------+-------------+-------------+----------------+ | l_orderkey | revenue | o_orderdate | o_shippriority | +-------------+-------------+-------------+----------------+ | 26724270146 | 510414.0360 | 1995-02-13 | 0 | | 43904139302 | 510414.0360 | 1995-02-13 | 0 | | 35314204736 | 510414.0360 | 1995-02-13 | 0 | | 52494073892 | 510414.0360 | 1995-02-13 | 0 | | 18134335556 | 510414.0360 | 1995-02-13 | 0 | | 9544400966 | 510414.0360 | 1995-02-13 | 0 | | 954466400 | 510414.0360 | 1995-02-13 | 0 | | 32903638691 | 503732.3318 | 1995-02-25 | 0 | | 15723769511 | 503732.3318 | 1995-02-25 | 0 | | 50083507847 | 503732.3318 | 1995-02-25 | 0 | +-------------+-------------+-------------+----------------+
If one changes the limit clause to 15, we will see that the last 3 rows should be a subset of rows with (503732.3318, 1995-02-25) as the ordering keys revenue and o_orderdate are identical so it's non-deterministic which rows are shown. The result will be deterministic if the order-by clause also include l_orderkey and o_shippriority
+-------------+-------------+-------------+----------------+ | l_orderkey | revenue | o_orderdate | o_shippriority | +-------------+-------------+-------------+----------------+ | 26724270146 | 510414.0360 | 1995-02-13 | 0 | | 43904139302 | 510414.0360 | 1995-02-13 | 0 | | 35314204736 | 510414.0360 | 1995-02-13 | 0 | | 52494073892 | 510414.0360 | 1995-02-13 | 0 | | 18134335556 | 510414.0360 | 1995-02-13 | 0 | | 9544400966 | 510414.0360 | 1995-02-13 | 0 | | 954466400 | 510414.0360 | 1995-02-13 | 0 | | 32903638691 | 503732.3318 | 1995-02-25 | 0 | | 15723769511 | 503732.3318 | 1995-02-25 | 0 | | 50083507847 | 503732.3318 | 1995-02-25 | 0 | | 7133834945 | 503732.3318 | 1995-02-25 | 0 | | 41493573281 | 503732.3318 | 1995-02-25 | 0 | | 24313704101 | 503732.3318 | 1995-02-25 | 0 | | 58673442437 | 503732.3318 | 1995-02-25 | 0 | | 54409232481 | 499104.0645 | 1995-03-10 | 0 | +-------------+-------------+-------------+----------------+