Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
4.0.0-alpha-2
Description
Hive does not support histogram statistics, which are particularly useful for skewed data (which is very common in practice) and range predicates.
Hive's current selectivity estimation for range predicates is based on a hard-coded value of 1/3 (see FilterSelectivityEstimator.java#L138-L144).])
The current proposal aims at integrating histogram as an additional column statistics, stored into the Hive metastore at the table (or partition) level.
The main requirements for histogram integration are the following:
- efficiency: the approach must scale and support billions of rows
- merge-ability: partition-level histograms have to be merged to form table-level histograms
- explicit and configurable trade-off between memory footprint and accuracy
Hive already integrates KLL data sketches UDAF. Datasketches are small, stateful programs that process massive data-streams and can provide approximate answers, with mathematical guarantees, to computationally difficult queries orders-of-magnitude faster than traditional, exact methods.
We propose to use KLL, and more specifically the cumulative distribution function (CDF), as the underlying data structure for our histogram statistics.
The current proposal targets numeric data types (float, integer and numeric families) and temporal data types (date and timestamp).
Attachments
Issue Links
- depends upon
-
HIVE-26243 Add vectorized implementation of the 'ds_kll_sketch' UDAF
- Closed
- is depended upon by
-
HIVE-26830 Update TPCDS30TB metastore dump with histograms
- Open
- is related to
-
HIVE-26313 Aggregate all column statistics into a single field in metastore
- In Progress
- relates to
-
HIVE-26772 Add support for specific column statistics to ANALYZE TABLE command
- Open
- supercedes
-
HIVE-26297 Refactoring ColumnStatsAggregator classes to reduce warnings
- Resolved
- links to