Re: 5123 I have also had some time to think about other aggregation functions (Please be aware that I am new to HBase, Coprocessors, and the Aggregation Protocol and I have little knowledge of distributed numerical algorithms!). It seems to me the pattern in AP is to return a SINGLE value from a SINGLE column (CF:CQ) of a table. In future one might wish to extend AP to return MULTIPLE values from MULTIPLE columns, so it is good to keep this in mind for the SINGLE value/SINGLE column (SVSC) case.

So, common SVSC aggregation functions:

currently supported:

min

max

sum

count

avg (arithmetic mean)

std

not currently supported:

median

mode

quantile/ntile

mult/product

for column values of all numeric types, returning values of that type. Current support is only for Long type.

Some thoughts on the future possibilities:

An example of a future SINGLE value MULTIPLE column use case could be weighted versions of the above functions i.e. a column of weights applied to the column of values then the new aggregation derived.

(note: there is a very good description of Weighted Median in the R language documentation:

http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/R.basic/html/weighted.median.html)

An example of future MULTIPLE value SINGLE column could be range: return all rows with a column value between two values. Maybe this is a bad example because there could be better HBase ways to do it with filters/scans at a higher level. Perhaps binning is a better example? i.e. return an array containing values derived from applying one of the SVSC functions to a binned column e.g:

int bins = 100;

aClient.sum(table, ci, scan, bins); =>

{12.3, 14.5...}

Another example (common in several programming languages) is to map an arbitrary function over a column and return the new vector. Of course, again this may be a bad example in the case of long HBase columns but it seems like an appropriate thing to do with coprocessors.

MULTIPLE value MULTIPLE column examples are common in spatial data processing but I see there has been a lot of spatial/GIS discussion around HBase which I have not read yet. So I'll keep quiet for now.

I hope these thoughts strike a balance between my (special interest) use case of statistical/spatial functions on tables and general purpose (but coprocessor enabled/regionserver distributed) HBase.

Some of this done, the rest left, whatever