Pig
  1. Pig
  2. PIG-2651

Provide a much easier to use accumulator interface

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.11
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      This introduces a new interface, IteratingAccumulatorEvalFunc (that name is NOT final...). The cool thing about this patch is that it is built purely on top of the existing Accumulator code (well, it uses PIG-2066, but it could easily work without it). That is to say, it's an easier way to write accumulators without having to fork the Pig codebase.

      The downside is that the only way I am able to provide such a clean interface is by using a second thread. I need to explore any potential performance implications, but given that most of the easy to use Pig stuff has performance implications, I think as long as we measure and and document them, it's worth the much more usable interface. Plus I don't think it will be too bad as one thread does the heavy lifting, while another just ferries values in between. SUM could now be written as:

      public class SUM extends IteratingAccumulatorEvalFunc<Long> {
          public Long exec(Iterator<Tuple> it) throws IOException {
              long sum = 0;
      
              while (it.hasNext()) {
                  sum += (Long)it.next().get(0);
              }
      
              return sum;
          }
      }
      

      Besides performance tests, I need to figure out how to properly test this sort of thing. I particularly welcome advice on that front.

      1. PIG-2651-2.patch
        14 kB
        Jonathan Coveney
      2. PIG-2651-1.patch
        31 kB
        Jonathan Coveney
      3. PIG-2651-0.patch
        16 kB
        Jonathan Coveney

        Issue Links

          Activity

          Jonathan Coveney created issue -
          Jonathan Coveney made changes -
          Field Original Value New Value
          Attachment PIG-2651-0.patch [ 12522542 ]
          Jonathan Coveney made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Jonathan Coveney made changes -
          Link This issue incorporates PIG-2066 [ PIG-2066 ]
          Jonathan Coveney made changes -
          Description This introduces a new interface, IteratingAccumulatorEvalFunc (that name is NOT final...). The cool thing about this patch is that it is built purely on top of the existing Accumulator code (well, it uses PIG-2066, but it could easily work without it). That is to say, it's an easier way to write accumulators without having to fork the Pig codebase.

          The downside is that the only way I am able to provide such a clean interface is by using a second thread. I need to explore any potential performance implications, but given that most of the easy to use Pig stuff has performance implications, I think as long as we measure and and document them, it's worth the much more usable interface. Plus I don't think it will be too bad as one thread does the heavy lifting, while another just ferries values in between. SUM could now be written as:

          {code}
          public class SUM extends IteratingAccumulatorEvalFunc<Long> {
              public Long exec(Iterator<Tuple> it) throws IOException {
                  long sum = 0;

                  while (it.hasNext()) {
                      sum += (Long)it.next().get(0);
                  }

                  return long;
              }
          }
          {code}

          Besides performance tests, I need to figure out how to properly test this sort of thing. I particularly welcome advice on that front.
          This introduces a new interface, IteratingAccumulatorEvalFunc (that name is NOT final...). The cool thing about this patch is that it is built purely on top of the existing Accumulator code (well, it uses PIG-2066, but it could easily work without it). That is to say, it's an easier way to write accumulators without having to fork the Pig codebase.

          The downside is that the only way I am able to provide such a clean interface is by using a second thread. I need to explore any potential performance implications, but given that most of the easy to use Pig stuff has performance implications, I think as long as we measure and and document them, it's worth the much more usable interface. Plus I don't think it will be too bad as one thread does the heavy lifting, while another just ferries values in between. SUM could now be written as:

          {code}
          public class SUM extends IteratingAccumulatorEvalFunc<Long> {
              public Long exec(Iterator<Tuple> it) throws IOException {
                  long sum = 0;

                  while (it.hasNext()) {
                      sum += (Long)it.next().get(0);
                  }

                  return sum;
              }
          }
          {code}

          Besides performance tests, I need to figure out how to properly test this sort of thing. I particularly welcome advice on that front.
          Jonathan Coveney made changes -
          Attachment PIG-2651-1.patch [ 12526464 ]
          Jonathan Coveney made changes -
          Attachment PIG-2651-2.patch [ 12530459 ]
          Daniel Dai made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Hadoop Flags Reviewed [ 10343 ]
          Fix Version/s 0.10.1 [ 12320547 ]
          Resolution Fixed [ 1 ]
          Bill Graham made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Jonathan Coveney
              Reporter:
              Jonathan Coveney
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development