[IMPALA-6491] More robust HBase scan cardinality estimation - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: Impala 2.5.0, Impala 2.6.0, Impala 2.7.0, Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, Impala 2.11.0
Fix Version/s: None
Component/s: Frontend
Labels:
- planner

Epic Color:
ghx-label-2

Description

There are a few issues with our HBase scan cardinality estimation:
1. The cardinality estimates can be very inaccurate leading to bad plan choices. In particular, users have reported cases of severe underestimation which can have a ripple effect in the query plan (e.g. planner thinks a join with that table is selective)
2. Unlike HDFS scans, we do not use row count statistics from the Hive Metastore for estimating the cardinality of HBase scans. Instead, we do a small scan over the HBase table and estimate a row count based on the average bytes per row and the storefile size.

There are other more detailed caveats with the HBase estimation method.

The original motivation of this method was to adjust the row count for queries that only scan a subset of the region servers (the HMS statistics only cover the entire table).

Proposal
To address these shortcomings, we could start with the table-level row count store in the Metastore and then adjust that number based on the total number of bytes in the table and the number of bytes in the relevant region servers.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Alexander Behm

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 07/Feb/18 23:54

Updated:: 07/Feb/18 23:54