Pig
  1. Pig
  2. PIG-2551

Create an AlgebraicEvalFunc and AccumulatorEvalFunc abstract class which gives you the lower levels for free

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.11
    • Component/s: None
    • Labels:
      None
    • Patch Info:
      Patch Available
    • Release Note:
      This was included in PIG-2317

      Description

      This is more of a win for the Algebraic interface than the Accumulator interface, but the idea is that if you implement the Algebraic interface, you should get Accumulator/EvalFunc for free, and if you implement Accumulator, you should get EvalFunc for free. The win of this is that in cases such as JRuby, you don't have to muck around doing this yourself...you have them implement the algebraic portion, and the rest comes free (that is where this came out of, but I feel like it is generally useful enough).

      The next piece of work I'd like to do is making an easier to implement way to make Algebraic UDFs, but then again, my to do is huge Would love thoughts on this. If it doesn't make it into Pig, it's still going to come in the JRuby stuff, so I thought it'd at least be worth having it separate, tested, and available to everyone.

      1. PIG-2551-0.patch
        12 kB
        Jonathan Coveney
      2. PIG-2551-1.patch
        19 kB
        Jonathan Coveney
      3. PIG-2551-2.patch
        21 kB
        Jonathan Coveney
      4. PIG-2551-3.patch
        22 kB
        Jonathan Coveney

        Issue Links

          Activity

          Hide
          Julien Le Dem added a comment -

          Thanks Jonathan
          Here are a few comments:

          • in AlgebraicEvalFunc.init()
            There is redundancy in checking for null and having the init boolean. Maybe the whole thing should just be wrapped in "if (!init) { ... }

            "

          • in AlgebraicEvalFunc.accumulate()
            intermedEvalFunc.exec() can also be called on intermediateDB (similar to Combiner on the reduce side). Depending on the size of intermediateDB, that could be useful as well. Possibly adding a threshold on the size of intermediateDB after which we call intermed on it?
          Show
          Julien Le Dem added a comment - Thanks Jonathan Here are a few comments: in AlgebraicEvalFunc.init() There is redundancy in checking for null and having the init boolean. Maybe the whole thing should just be wrapped in "if (!init) { ... } " in AlgebraicEvalFunc.accumulate() intermedEvalFunc.exec() can also be called on intermediateDB (similar to Combiner on the reduce side). Depending on the size of intermediateDB, that could be useful as well. Possibly adding a threshold on the size of intermediateDB after which we call intermed on it?
          Hide
          Daniel Dai added a comment -

          Generally it looks good. In addition to Julien's comment, it also needs header and javadoc. We need to provide guidance for user to use both functions. From the chain algebraic->accumulate->evalfunc, once user implements a functionality, he automatically get the functionality on the right side. Also need to document from performance point of view, this is not preferred implementation since the simulation would make extra function calls.

          Show
          Daniel Dai added a comment - Generally it looks good. In addition to Julien's comment, it also needs header and javadoc. We need to provide guidance for user to use both functions. From the chain algebraic->accumulate->evalfunc, once user implements a functionality, he automatically get the functionality on the right side. Also need to document from performance point of view, this is not preferred implementation since the simulation would make extra function calls.
          Hide
          Dmitriy V. Ryaboy added a comment -

          We should benchmark the performance loss on this. I suspect it's a rounding error.

          Show
          Dmitriy V. Ryaboy added a comment - We should benchmark the performance loss on this. I suspect it's a rounding error.
          Hide
          Jonathan Coveney added a comment -

          Thanks for your comments, Julien and Daniel!

          All, please find attached the revised patch, per your notes.

          • I added comments
          • I added a basic heuristic to apply the intermediate EvalFunc in cases where applying it gives a useful reduction in size.
          • I added PigCounterHelper to Pig from ElephantBird. It's a more reasonable place to live, and it is useful. This facilitates logging to Pig from UDFs. I use this to collect stats on the combining activity when an Algebraic UDF is used as an Accumulator.

          Also, Daniel, I did some benchmarking per Dmitriy's comment, and I don't know that it's appreciably slower. On 1M bags, here is a benchmark on the accumulator piece:

          AlgSum 14.9 ============================
          AlgCount 15.9 ==============================
          Sum 13.7 =========================
          Count 13.4 =========================

          AlgSum and AlgCount are just a version of AlgebraicEvalFunc that returns the static classes from LongSum and COUNT, but in this benchmark I called accumulate. The purpose of this is because it is in using accumulate that the function calling overhead is going to be largest.

          As you can see, the falloff is minimal, so I don't know that some big disclaimer is necessary (any more than it's necessary to say that Jython UDFs are slower than Java UDFs or whatnot).

          For the accumulator eval func, there is no overhead, and a lot of people I know when implementing accumulative UDFs basically do that manually as is.

          Show
          Jonathan Coveney added a comment - Thanks for your comments, Julien and Daniel! All, please find attached the revised patch, per your notes. I added comments I added a basic heuristic to apply the intermediate EvalFunc in cases where applying it gives a useful reduction in size. I added PigCounterHelper to Pig from ElephantBird. It's a more reasonable place to live, and it is useful. This facilitates logging to Pig from UDFs. I use this to collect stats on the combining activity when an Algebraic UDF is used as an Accumulator. Also, Daniel, I did some benchmarking per Dmitriy's comment, and I don't know that it's appreciably slower. On 1M bags, here is a benchmark on the accumulator piece: AlgSum 14.9 ============================ AlgCount 15.9 ============================== Sum 13.7 ========================= Count 13.4 ========================= AlgSum and AlgCount are just a version of AlgebraicEvalFunc that returns the static classes from LongSum and COUNT, but in this benchmark I called accumulate. The purpose of this is because it is in using accumulate that the function calling overhead is going to be largest. As you can see, the falloff is minimal, so I don't know that some big disclaimer is necessary (any more than it's necessary to say that Jython UDFs are slower than Java UDFs or whatnot). For the accumulator eval func, there is no overhead, and a lot of people I know when implementing accumulative UDFs basically do that manually as is.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Jon, describe the benchmark methodology?

          Show
          Dmitriy V. Ryaboy added a comment - Jon, describe the benchmark methodology?
          Hide
          Daniel Dai added a comment -

          Hi, Jon, we don't need a big disclaimer, but it's better to mention the fact the implementation is based on simulation, user can chose to implement a native version if he wants to optimize.

          Show
          Daniel Dai added a comment - Hi, Jon, we don't need a big disclaimer, but it's better to mention the fact the implementation is based on simulation, user can chose to implement a native version if he wants to optimize.
          Hide
          Jonathan Coveney added a comment -

          I have LongSum, COUNT, AlgSum, and AlgCount. AlgSum and AlgCount are just wrappers which extend AlgebraicEvalFunc, returning the static classes from LongSum and COUNT respectiviely (the purpose being that their Algebraic implementation is identical, so you're testing the overhead of the extra function calls in the Accumulator they give you).

          I then used Caliper to run a benchmark which instantiated each as an Accumulator<Long>, and ran it on a DataBag I streamed through it.

          See the code to set up:

          @Override protected void setUp() {
              try {
                  theBag = mBagFactory.newDefaultBag();
                  for (int i = 0; i < size; i++) {
                      Tuple t = mTupleFactory.newTuple(1);
                      t.set(0, i); 
                      theBag.add(t);
                  }   
              } catch (Exception e) {
                  throw new RuntimeException("Error in setup");
              }   
          }   
          
          

          See the code to run:

          public long go(Accumulator<Long> acc) {
              try {
                  Iterator<Tuple> it = theBag.iterator();
                  while (it.hasNext()) {
                      DataBag tempBag = mBagFactory.newDefaultBag();
                      for (int j = 0; it.hasNext() && j < perAcc; j++)
                          tempBag.add(it.next());
                      Tuple t = mTupleFactory.newTuple(1);
                      t.set(0, tempBag);
                      acc.accumulate(t);
                  }   
                  return acc.getValue();
              } catch (Exception e) {
                  throw new RuntimeException("Error in go");
              }   
          }  
          

          The parameter "perAcc" is how many elements will be streamed through the accumulate function at a time, and was set to 1000. The size was set to 1000000. There were 10 trials.

          Show
          Jonathan Coveney added a comment - I have LongSum, COUNT, AlgSum, and AlgCount. AlgSum and AlgCount are just wrappers which extend AlgebraicEvalFunc, returning the static classes from LongSum and COUNT respectiviely (the purpose being that their Algebraic implementation is identical, so you're testing the overhead of the extra function calls in the Accumulator they give you). I then used Caliper to run a benchmark which instantiated each as an Accumulator<Long>, and ran it on a DataBag I streamed through it. See the code to set up: @Override protected void setUp() { try { theBag = mBagFactory.newDefaultBag(); for ( int i = 0; i < size; i++) { Tuple t = mTupleFactory.newTuple(1); t.set(0, i); theBag.add(t); } } catch (Exception e) { throw new RuntimeException( "Error in setup" ); } } See the code to run: public long go(Accumulator< Long > acc) { try { Iterator<Tuple> it = theBag.iterator(); while (it.hasNext()) { DataBag tempBag = mBagFactory.newDefaultBag(); for ( int j = 0; it.hasNext() && j < perAcc; j++) tempBag.add(it.next()); Tuple t = mTupleFactory.newTuple(1); t.set(0, tempBag); acc.accumulate(t); } return acc.getValue(); } catch (Exception e) { throw new RuntimeException( "Error in go" ); } } The parameter "perAcc" is how many elements will be streamed through the accumulate function at a time, and was set to 1000. The size was set to 1000000. There were 10 trials.
          Hide
          Jonathan Coveney added a comment -

          Sounds fair, Daniel. Anything else I should do to get a +1? Thanks again

          Show
          Jonathan Coveney added a comment - Sounds fair, Daniel. Anything else I should do to get a +1? Thanks again
          Hide
          Daniel Dai added a comment -

          +1 as long as we have proper javadoc and header

          Show
          Daniel Dai added a comment - +1 as long as we have proper javadoc and header
          Hide
          Jonathan Coveney added a comment -

          Added more javadocs, and the headers should be in.

          Show
          Jonathan Coveney added a comment - Added more javadocs, and the headers should be in.
          Hide
          Daniel Dai added a comment -

          Still missing header for HelperEvalFuncUtils.java. Otherwise +1.

          Show
          Daniel Dai added a comment - Still missing header for HelperEvalFuncUtils.java. Otherwise +1.
          Hide
          Jonathan Coveney added a comment -

          Woops, thought I had gotten them all. Here you go!

          Show
          Jonathan Coveney added a comment - Woops, thought I had gotten them all. Here you go!
          Hide
          Jonathan Coveney added a comment -

          Does this look good to y'all now?

          Show
          Jonathan Coveney added a comment - Does this look good to y'all now?
          Hide
          Daniel Dai added a comment -

          +1

          Show
          Daniel Dai added a comment - +1

            People

            • Assignee:
              Jonathan Coveney
              Reporter:
              Jonathan Coveney
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development