Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-1713

SAMPLE command should accept parameters to specify alternative sampling algorithm

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      I have a script which takes in a command line parameter.

      pig -p number=100 script.pig
      

      The script contains the following parameters:

      A = load '/user/viraj/test' using PigStorage() as (a,b,c);
      
      B = SAMPLE A 1/$number;
      
      dump B;
      

      Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.

      Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.

      Ideal use case:

      A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3);
      
      ...
      ...
      
      W = group X by col1;
      
      Z = foreach Y generate AVG(X);
      
      AA = load '/user/viraj/test' using PigStorage() as (a,b,c);
      
      BB = SAMPLE AA 1/Z;
      
      dump BB;
      

      Viraj

      Change this Jira to only track sampling algorithm. PIG-1926 is opened to track limit/sample taking scalar.

      This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                viraj Viraj Bhat
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated: