Pig
  1. Pig
  2. PIG-1713

SAMPLE command should accept parameters to specify alternative sampling algorithm

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      I have a script which takes in a command line parameter.

      pig -p number=100 script.pig
      

      The script contains the following parameters:

      A = load '/user/viraj/test' using PigStorage() as (a,b,c);
      
      B = SAMPLE A 1/$number;
      
      dump B;
      

      Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.

      Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.

      Ideal use case:

      A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3);
      
      ...
      ...
      
      W = group X by col1;
      
      Z = foreach Y generate AVG(X);
      
      AA = load '/user/viraj/test' using PigStorage() as (a,b,c);
      
      BB = SAMPLE AA 1/Z;
      
      dump BB;
      

      Viraj

      Change this Jira to only track sampling algorithm. PIG-1926 is opened to track limit/sample taking scalar.

      This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

        Issue Links

          Activity

          No work has yet been logged on this issue.

            People

            • Assignee:
              Unassigned
              Reporter:
              Viraj Bhat
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:

                Development