Details
-
Improvement
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
Impala 3.1.0
-
None
-
None
-
ghx-label-4
Description
IMPALA-8035 describes how Impala currently estimates inequality: lump all non-equality predicates together an assume a single 0.1 selectivity for the whole group. As we try to fix that, we hit another issue. The bug here assumes we are treating inequality correctly on a per-predicate basis.
If a query has two inequalities on the same column, and they are of the same “direction”, then only the one with the larger (or smaller) applies. Selectivity estimates should reflect this fact.
select * from tpch.customer c where c.c_custkey < 1234 and c.c_custkey < 2345 ---- PLAN PLAN-ROOT SINK | 00:SCAN HDFS [tpch.customer c] partitions=1/1 files=1 size=23.08MB row-size=218B cardinality=28.44K predicates: c.c_custkey < 1234, c.c_custkey < 2345
Expected:
00:SCAN HDFS [tpch.customer c] partitions=1/1 files=1 size=23.08MB row-size=218B cardinality=49.50K
The calcs don't even need to do the math. Just noticing two expressions in the same direction is sufficient: count only one of them toward overall selectivity; doesn't matter which one.