Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-741

Add LIMIT as a statement that works in nested FOREACH

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.3.0
    • None
    • None

    Description

      I'd like to compute the top 10 results in each group.

      The natural way to express this in Pig would be:

      A = load '...' using PigStorage() as (
          date: int,
          count: int,
          url: chararray
      );
      
      B = group A by ( date );
      
      C = foreach B {
          D = order A by count desc;
          E = limit D 10;
          generate
              FLATTEN(E);
          };
      
      dump C;
      

      Yeah, I could write a UDF / PiggyBank function to take the top n results. But since LIMIT already exists as a statement, it seems like it should also work in the nested foreach context.

      Example workaround code.

      C = foreach B {
          D = order A by count desc;
          E = util.TOP(D, 10);
          generate
              FLATTEN(E);
          };
      
      dump C;
      

      Attachments

        1. PIG-741.patch
          10 kB
          Alan Gates

        Activity

          People

            gates Alan Gates
            ciemo David Ciemiewicz
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: