[IMPALA-8058] HBase scan cardinality division-by-zero leads to bogus cardinality - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: Impala 3.1.0
Fix Version/s: Impala 3.2.0
Component/s: Frontend
Labels:
None

Epic Color:
ghx-label-2

Description

A particular HBase query has highly selective key filters and runs into code bugs that produce a bogus, huge cardinality value.

HbaseScanNode.computeStats() attempts to compute table cardinality by calling HBaseTable.getEstimatedRowStats(). This then calls into (in the latest versions) FeHBaseTable.getEstimatedRowStats().

This code tries to estimate cardinality by:

Scanning a set of regions.
For each getting the size.
Averaging a bunch of rows to estimate row width.

Once we know the size of the regions we need to scan, and the average row width, we can compute the scan cardinality.

The problem in this particular query is that the predicates are so selective that no regions match. As a result, the average row width is zero. We divide (as a double) the region size by 0 and get INF. We cast that to a long and get Long.MAX_VALUE. We then use that as our (highly bogus) cardinality estimate.

The code must:

Detect the division-by-zero (now sample rows) case.
Use an alternative estimate (such as multiplying total table row count from HMS by the filter selectivity.)

Attachments

Activity

People

Assignee:: Paul Rogers

Reporter:: Paul Rogers

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 08/Jan/19 23:59

Updated:: 04/Sep/19 17:57

Resolved:: 15/Mar/19 00:00