HBase
  1. HBase
  2. HBASE-5123

Provide more aggregate functions for Aggregations Protocol

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Royston requested the following aggregates on top of what we already have:
      Median, Weighted Median, Mult

      See discussion entitled 'AggregateProtocol Help' on user list

        Activity

        Hide
        Tom Wilcox added a comment -

        SumProduct is probably another useful one.

        Show
        Tom Wilcox added a comment - SumProduct is probably another useful one.
        Hide
        Ted Yu added a comment -

        Haven't figured out how Mult is computed. Let me start with median.

        Show
        Ted Yu added a comment - Haven't figured out how Mult is computed. Let me start with median.
        Hide
        Royston Sellman added a comment -

        Re: 5123 I have also had some time to think about other aggregation functions (Please be aware that I am new to HBase, Coprocessors, and the Aggregation Protocol and I have little knowledge of distributed numerical algorithms!). It seems to me the pattern in AP is to return a SINGLE value from a SINGLE column (CF:CQ) of a table. In future one might wish to extend AP to return MULTIPLE values from MULTIPLE columns, so it is good to keep this in mind for the SINGLE value/SINGLE column (SVSC) case.

        So, common SVSC aggregation functions:
        currently supported:
        min
        max
        sum
        count
        avg (arithmetic mean)
        std

        not currently supported:
        median
        mode
        quantile/ntile
        mult/product

        for column values of all numeric types, returning values of that type. Current support is only for Long type.

        Some thoughts on the future possibilities:
        An example of a future SINGLE value MULTIPLE column use case could be weighted versions of the above functions i.e. a column of weights applied to the column of values then the new aggregation derived.
        (note: there is a very good description of Weighted Median in the R language documentation:
        http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/R.basic/html/weighted.median.html)

        An example of future MULTIPLE value SINGLE column could be range: return all rows with a column value between two values. Maybe this is a bad example because there could be better HBase ways to do it with filters/scans at a higher level. Perhaps binning is a better example? i.e. return an array containing values derived from applying one of the SVSC functions to a binned column e.g:
        int bins = 100;
        aClient.sum(table, ci, scan, bins); =>

        {12.3, 14.5...}

        Another example (common in several programming languages) is to map an arbitrary function over a column and return the new vector. Of course, again this may be a bad example in the case of long HBase columns but it seems like an appropriate thing to do with coprocessors.

        MULTIPLE value MULTIPLE column examples are common in spatial data processing but I see there has been a lot of spatial/GIS discussion around HBase which I have not read yet. So I'll keep quiet for now.

        I hope these thoughts strike a balance between my (special interest) use case of statistical/spatial functions on tables and general purpose (but coprocessor enabled/regionserver distributed) HBase.

        Show
        Royston Sellman added a comment - Re: 5123 I have also had some time to think about other aggregation functions (Please be aware that I am new to HBase, Coprocessors, and the Aggregation Protocol and I have little knowledge of distributed numerical algorithms!). It seems to me the pattern in AP is to return a SINGLE value from a SINGLE column (CF:CQ) of a table. In future one might wish to extend AP to return MULTIPLE values from MULTIPLE columns, so it is good to keep this in mind for the SINGLE value/SINGLE column (SVSC) case. So, common SVSC aggregation functions: currently supported: min max sum count avg (arithmetic mean) std not currently supported: median mode quantile/ntile mult/product for column values of all numeric types, returning values of that type. Current support is only for Long type. Some thoughts on the future possibilities: An example of a future SINGLE value MULTIPLE column use case could be weighted versions of the above functions i.e. a column of weights applied to the column of values then the new aggregation derived. (note: there is a very good description of Weighted Median in the R language documentation: http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/R.basic/html/weighted.median.html ) An example of future MULTIPLE value SINGLE column could be range: return all rows with a column value between two values. Maybe this is a bad example because there could be better HBase ways to do it with filters/scans at a higher level. Perhaps binning is a better example? i.e. return an array containing values derived from applying one of the SVSC functions to a binned column e.g: int bins = 100; aClient.sum(table, ci, scan, bins); => {12.3, 14.5...} Another example (common in several programming languages) is to map an arbitrary function over a column and return the new vector. Of course, again this may be a bad example in the case of long HBase columns but it seems like an appropriate thing to do with coprocessors. MULTIPLE value MULTIPLE column examples are common in spatial data processing but I see there has been a lot of spatial/GIS discussion around HBase which I have not read yet. So I'll keep quiet for now. I hope these thoughts strike a balance between my (special interest) use case of statistical/spatial functions on tables and general purpose (but coprocessor enabled/regionserver distributed) HBase.
        Hide
        Ted Yu added a comment - - edited

        AggregationClient.std() already provides support for standard deviation.

        Show
        Ted Yu added a comment - - edited AggregationClient.std() already provides support for standard deviation.
        Hide
        haosdent added a comment -

        How about provide a way to define custom functions for users?

        Show
        haosdent added a comment - How about provide a way to define custom functions for users?

          People

          • Assignee:
            Unassigned
            Reporter:
            Ted Yu
          • Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development