Description
Equi-height histogram is effective in handling skewed data distribution.
For equi-height histogram, the heights of all bins(intervals) are the same. The default number of bins we use is 254.
Now we use a two-step method to generate an equi-height histogram:
1. use percentile_approx to get percentiles (end points of the equi-height bin intervals);
2. use a new aggregate function to get distinct counts in each of these bins.
Note that this method takes two table scans. In the future we may provide other algorithms which need only one table scan.
Attachments
Issue Links
- blocks
-
SPARK-21322 support histogram in filter cardinality estimation
- Resolved
-
SPARK-21984 Use histogram stats in join estimation
- Resolved
- incorporates
-
SPARK-18000 Aggregation function for computing bins (distinct value, count) pairs for equi-width histograms
- Closed
-
SPARK-17881 Aggregation function for generating string histograms
- Closed
- is blocked by
-
SPARK-17997 Aggregation function for counting distinct values for multiple intervals
- Resolved
-
SPARK-22100 Make percentile_approx support date/timestamp type and change the output type to be the same as input type
- Resolved
- links to