[IMPALA-8037] Planner makes poor estimations for non-equality predicates - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: Impala 3.1.0
Fix Version/s: None
Component/s: Frontend
Labels:
None

Epic Color:
ghx-label-3

Description

Consider the TPC-H queries. Here is a part of one of them:

select
  count(*)
from tpch.customer c,
  tpch.orders o
where
  c.c_mktsegment = 'BUILDING'
  and c.c_custkey = o.o_custkey
  and o_orderdate < '1995-03-15'

This produces the following cardinality estimates (using the revised, standard join cardinality estimator):

| 02:HASH JOIN [INNER JOIN, BROADCAST]                                      |
| |  hash predicates: o.o_custkey = c.c_custkey                             |
| |  runtime filters: RF000 <- c.c_custkey                                  |
| |  row-size=59B cardinality=30.00K                                        |
| |                                                                         |
| |--04:EXCHANGE [BROADCAST]                                                |
| |  |                                                                      |
| |  00:SCAN HDFS [tpch.customer c]                                         |
| |     partitions=1/1 files=1 size=23.08MB row-size=29B cardinality=30.00K |
| |     predicates: c.c_mktsegment = 'BUILDING'                             |
| |                                                                         |
| 01:SCAN HDFS [tpch.orders o]                                              |
|    partitions=1/1 files=1 size=162.56MB row-size=30B cardinality=150.00K  |
|    predicates: o_orderdate < '1995-03-15'                                 |
|    runtime filters: RF000 -> o.o_custkey                                  |

Run the query, the actual count is 147K vs. the estimate of 30K. The culprit here is the HDFS Scan node with an estimated cardinality of 150K. Run just this query:

select count(*)
from tpch.orders o
where
  o_orderdate < '1995-03-15'

With this plan:

| 00:SCAN HDFS [tpch.orders o]                                             |
|    partitions=1/1 files=1 size=162.56MB row-size=22B cardinality=150.00K |
|    predicates: o_orderdate < '1995-03-15'                                |

With with an actual result of 727K. Had this been used in the previous plan instead of 30K, the join cardinality would have been about four times larger, or 120K, which is far closer to the actual total.

So, the problem is the mis-estimation of the inequality, which are very hard to estimate. The actual selectivity, found by running a count(*) on the table, is 727K / 1.5M =~ 0.5. But, Impala estimated 30K / 1.5M = 0.02. Most systems estimate around 0.45 or so.

For example, using a default estimate of 1/3 for inequality provides the following plan, which is much better:

03:HASH JOIN [INNER JOIN]
|  hash predicates: c.c_custkey = o.o_custkey
|  runtime filters: RF002 <- o.o_custkey
|  row-size=71B cardinality=120.00K
|
|--01:SCAN HDFS [tpch.orders o]
|     partitions=1/1 files=1 size=162.56MB row-size=42B cardinality=600.00K
|     predicates: o_orderdate < '1995-03-15'
|     runtime filters: RF000 -> o.o_orderkey

Makes one wonder if the data set was designed with the industry-standard defaults in mind...

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Paul Rogers

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 31/Dec/18 20:32

Updated:: 22/Feb/19 00:00