Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
Impala 3.1.0
-
None
-
None
-
ghx-label-8
Description
A query may contain an inequality predicate. TPC-H has many such as l_shipdate <= '1998-09-02'.
The planer must know the selectivity of each predicate applied to filter a table. Inequalities are impossible to estimate from just the NDV value available in the catalog. As a result, most systems assume some value around .3 or .4. (Textbooks recommend .3).
The query literature notes that the best way to estimate an inequality is with histograms. The literature also knows a cheaper alternative:
- Assume uniform value distribution, and
- Gather the minimum and maximum column values.
Given this it is easy to estimate an inequality as:
sel(c < x) = (x - min(c)) / (max(c) - min(c)) sel(c > x) = (max(c) - x) / (max(c) - min(c))
The cost is just two extra values per column rather than the full cost of a histogram.
Attachments
Issue Links
- relates to
-
IMPALA-8042 Better selectivity estimate for BETWEEN
- Resolved