Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-1397

histogram() UDAF for a numerical column

Log workAgile BoardRank to TopRank to BottomVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.6.0
    • 0.6.0
    • Query Processor
    • None
    • Reviewed

    Description

      A histogram() UDAF to generate an approximate histogram of a numerical (byte, short, double, long, etc.) column. The result is returned as a map of (x,y) histogram pairs, and can be plotted in Gnuplot using impulses (for example). The algorithm is currently adapted from "A streaming parallel decision tree algorithm" by Ben-Haim and Tom-Tov, JMLR 11 (2010), and uses space proportional to the number of histogram bins specified. It has no approximation guarantees, but seems to work well when there is a lot of data and a large number (e.g. 50-100) of histogram bins specified.

      A typical call might be:

      SELECT histogram(val, 10) FROM some_table;

      where the result would be a histogram with 10 bins, returned as a Hive map object.

      Attachments

        1. HIVE-1397.2.patch
          28 kB
          Mayank Lahiri
        2. HIVE-1397.1.patch
          25 kB
          Mayank Lahiri
        3. Histogram_quality.png.jpg
          31 kB
          Mayank Lahiri

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            mayanklahiri Mayank Lahiri Assign to me
            mayanklahiri Mayank Lahiri
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Issue deployment