[IMPALA-2373] Extrapolate the number of rows in a scan based on the rows/byte ratio - ASF JIRA

XML

Word

Printable

JSON

Type: New Feature
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: Impala 2.3.0, Impala 2.5.0, Impala 2.4.0, Impala 2.6.0, Impala 2.7.0, Impala 2.8.0
Fix Version/s: Impala 2.10.0
Component/s: Frontend
Labels:
- ramp-up

This JIRA is intended to address the following problems

Some partitions may be missing the #rows stat
Some partitions may have the #rows stat but it is stale because files were added/dropped since computing the #rows stat

The main idea is to use available #rows stats to extrapolate the missing stats

Store an additional statistic rows/byte in the TBLPROPERTIES of the table (could also be rows/kbyte or whatever seems most suitable)
That statistic is computed as part of COMPUTE [INCREMENTAL] STATS on the impalad side, and then shipped to the catalogd for it to be stored in the Metastore
During query planning we use the rows/byte statistic to estimate the number of rows scanned for all partitions regardless of whether a partition has #rows or not. The rationale is that the #rows of a partition may be outdated and using the rows/byte ratio is more robust to data changes.
We should augment SHOW TABLE STATS to display the stored #rows as well as the extrapolated #rows.
We should have some way of reporting the stored rows/byte ratio for debugging purposes (maybe SHOW TABLE STATS or EXPLAIN?)

Additional considerations

Non-Goals

Estimate statistics if there are no stats at all, e.g. purely based on file size without knowing any #rows
Extrapolate column stats like NDV in a similar fashion. That is a much more invasive change with a smaller impact.

relates to

IMPALA-6228 More flexible configuration of stats extrapolation

IMPALA-6459 Doc: TABLESAMPLE for COMPUTE STATS

Closed

Alexandra Rodoni