This JIRA is intended to address the following problems
- Some partitions may be missing the #rows stat
- Some partitions may have the #rows stat but it is stale because files were added/dropped since computing the #rows stat
The main idea is to use available #rows stats to extrapolate the missing stats
- Store an additional statistic rows/byte in the TBLPROPERTIES of the table (could also be rows/kbyte or whatever seems most suitable)
- That statistic is computed as part of COMPUTE [INCREMENTAL] STATS on the impalad side, and then shipped to the catalogd for it to be stored in the Metastore
- During query planning we use the rows/byte statistic to estimate the number of rows scanned for all partitions regardless of whether a partition has #rows or not. The rationale is that the #rows of a partition may be outdated and using the rows/byte ratio is more robust to data changes.
- We should augment SHOW TABLE STATS to display the stored #rows as well as the extrapolated #rows.
- We should have some way of reporting the stored rows/byte ratio for debugging purposes (maybe SHOW TABLE STATS or EXPLAIN?)
- A table could have mixed formats
- Even if a table has the same format, files could be compressed differently
- It seems reasonable to ignore these issues in the first cut
- Estimate statistics if there are no stats at all, e.g. purely based on file size without knowing any #rows
- Extrapolate column stats like NDV in a similar fashion. That is a much more invasive change with a smaller impact.
|Document changes to row estimates