Details
-
New Feature
-
Status: Closed
-
Major
-
Resolution: Fixed
-
0.6.0
-
None
-
Reviewed
Description
A histogram() UDAF to generate an approximate histogram of a numerical (byte, short, double, long, etc.) column. The result is returned as a map of (x,y) histogram pairs, and can be plotted in Gnuplot using impulses (for example). The algorithm is currently adapted from "A streaming parallel decision tree algorithm" by Ben-Haim and Tom-Tov, JMLR 11 (2010), and uses space proportional to the number of histogram bins specified. It has no approximation guarantees, but seems to work well when there is a lot of data and a large number (e.g. 50-100) of histogram bins specified.
A typical call might be:
SELECT histogram(val, 10) FROM some_table;
where the result would be a histogram with 10 bins, returned as a Hive map object.