Details

Type: New Feature

Status: Open

Priority: Major

Resolution: Unresolved

Affects Version/s: None

Fix Version/s: None

Component/s: Query Processor, Statistics

Labels:
Description
Add commands to collect partition and column level statistics in hive.
Issue Links
 incorporates

HIVE1361 table/partition level statistics
 Closed

HIVE1362 Optimizer statistics on columns in tables and partitions
 Closed
 relates to

HIVE2472 Metastore statistics are not being updated for CTAS queries.
 Closed

HIVE1940 Query Optimization Using Column Statistics and Histograms
 Resolved

HIVE1938 Cost Based Query optimization for Joins in Hive
 Open

HIVE3027 The optimizer architecture of Hive is terrible, need code refactoring
 Open

HIVE3917 Support noscan operation for analyze command
 Closed
Type of statistics:
The following types of statistics can be collected on hive partitions >
For each partition of the table:
1. Number of Rows
2. Size of the partition
3. Average size of a row
4. Number of blocks
For a column in the partition:
1. Number of distinct values
2. Number of null values
3. minimum 3 values
4. maximum 3 values
5. Histogram: Frequency histogram or a height balanced histogram (the former has equi range bins while the later has the same height for all the bins)
The column level statistics could also be calculated for distributions in an average block
Language Elements:
ANALYZE TABLE <t> PARTITION(<partitionspec>) COMPUTE STATISTICS  this computes the partition level statistics
ANALYZE TABLE <t> PARTITION(<partitionspec>) COMPUTE STATISTICS FOR ALL COLUMNS SIZE n  this computes the column level statistics for all columns with n being the number of bins in the historgram
ANALYZE TABLE <t> PARTITION(<partitionspec>) COMPUTE STATISTICS FOR COLUMNS SIZE m c1 SIZE n1, c2 SIZE n2, c3  this computes the column level statistics for columns c1 (using n1 bins for the histogram), c2(using n2 bins) and c3 (using the defaut m bins)
We can later extend these so that these commands can work on samples and be able to extrapolate the results to the entire data set. For that we could use the ESTIMATE STATISTICS SAMPLE n ROWS or ESTIMATE STATISTICS SAMPLE n%
e.g.
ANALYZE TABLE <t> PARTITION(<partitionspec>) ESTIMATE STATISTICS 10%
More details on the actual implementation to follow...