Thanks for your comments, Julien and Daniel!
All, please find attached the revised patch, per your notes.
- I added comments
- I added a basic heuristic to apply the intermediate EvalFunc in cases where applying it gives a useful reduction in size.
- I added PigCounterHelper to Pig from ElephantBird. It's a more reasonable place to live, and it is useful. This facilitates logging to Pig from UDFs. I use this to collect stats on the combining activity when an Algebraic UDF is used as an Accumulator.
Also, Daniel, I did some benchmarking per Dmitriy's comment, and I don't know that it's appreciably slower. On 1M bags, here is a benchmark on the accumulator piece:
AlgSum 14.9 ============================
AlgCount 15.9 ==============================
Sum 13.7 =========================
Count 13.4 =========================
AlgSum and AlgCount are just a version of AlgebraicEvalFunc that returns the static classes from LongSum and COUNT, but in this benchmark I called accumulate. The purpose of this is because it is in using accumulate that the function calling overhead is going to be largest.
As you can see, the falloff is minimal, so I don't know that some big disclaimer is necessary (any more than it's necessary to say that Jython UDFs are slower than Java UDFs or whatnot).
For the accumulator eval func, there is no overhead, and a lot of people I know when implementing accumulative UDFs basically do that manually as is.