Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-1713

SAMPLE command should accept parameters to specify alternative sampling algorithm

Add voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None

    Description

      I have a script which takes in a command line parameter.

      pig -p number=100 script.pig
      

      The script contains the following parameters:

      A = load '/user/viraj/test' using PigStorage() as (a,b,c);
      
      B = SAMPLE A 1/$number;
      
      dump B;
      

      Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.

      Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.

      Ideal use case:

      A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3);
      
      ...
      ...
      
      W = group X by col1;
      
      Z = foreach Y generate AVG(X);
      
      AA = load '/user/viraj/test' using PigStorage() as (a,b,c);
      
      BB = SAMPLE AA 1/Z;
      
      dump BB;
      

      Viraj

      Change this Jira to only track sampling algorithm. PIG-1926 is opened to track limit/sample taking scalar.

      This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            viraj Viraj Bhat

            Dates

              Created:
              Updated:

              Slack

                Issue deployment