Pig
  1. Pig
  2. PIG-1713

SAMPLE command should accept parameters to specify alternative sampling algorithm

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      I have a script which takes in a command line parameter.

      pig -p number=100 script.pig
      

      The script contains the following parameters:

      A = load '/user/viraj/test' using PigStorage() as (a,b,c);
      
      B = SAMPLE A 1/$number;
      
      dump B;
      

      Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.

      Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.

      Ideal use case:

      A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3);
      
      ...
      ...
      
      W = group X by col1;
      
      Z = foreach Y generate AVG(X);
      
      AA = load '/user/viraj/test' using PigStorage() as (a,b,c);
      
      BB = SAMPLE AA 1/Z;
      
      dump BB;
      

      Viraj

      Change this Jira to only track sampling algorithm. PIG-1926 is opened to track limit/sample taking scalar.

      This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

        Issue Links

          Activity

          Gianmarco De Francisci Morales made changes -
          Link This issue relates to PIG-3224 [ PIG-3224 ]
          Gianmarco De Francisci Morales made changes -
          Link This issue relates to PIG-3221 [ PIG-3221 ]
          Gianmarco De Francisci Morales made changes -
          Link This issue relates to PIG-3225 [ PIG-3225 ]
          Daniel Dai made changes -
          Description I have a script which takes in a command line parameter.

          {code}
          pig -p number=100 script.pig
          {code}

          The script contains the following parameters:

          {code}
          A = load '/user/viraj/test' using PigStorage() as (a,b,c);

          B = SAMPLE A 1/$number;

          dump B;
          {code}

          Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.

          Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.

          Ideal use case:

          {code}
          A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3);

          ...
          ...

          W = group X by col1;

          Z = foreach Y generate AVG(X);

          AA = load '/user/viraj/test' using PigStorage() as (a,b,c);

          BB = SAMPLE AA 1/Z;

          dump BB;
          {code}

          Viraj

          Change this Jira to only track sampling algorithm. PIG-1926 is opened to track limit/sample taking scalar.

          This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011
          I have a script which takes in a command line parameter.

          {code}
          pig -p number=100 script.pig
          {code}

          The script contains the following parameters:

          {code}
          A = load '/user/viraj/test' using PigStorage() as (a,b,c);

          B = SAMPLE A 1/$number;

          dump B;
          {code}

          Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.

          Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.

          Ideal use case:

          {code}
          A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3);

          ...
          ...

          W = group X by col1;

          Z = foreach Y generate AVG(X);

          AA = load '/user/viraj/test' using PigStorage() as (a,b,c);

          BB = SAMPLE AA 1/Z;

          dump BB;
          {code}

          Viraj

          Change this Jira to only track sampling algorithm. PIG-1926 is opened to track limit/sample taking scalar.

          This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012
          Daniel Dai made changes -
          Labels gsoc2011 gsoc2012
          Olga Natkovich made changes -
          Fix Version/s 0.10 [ 12316246 ]
          Daniel Dai made changes -
          Summary SAMPLE command should accept parameters SAMPLE command should accept parameters to specify alternative sampling algorithm
          Description I have a script which takes in a command line parameter.

          {code}
          pig -p number=100 script.pig
          {code}

          The script contains the following parameters:

          {code}
          A = load '/user/viraj/test' using PigStorage() as (a,b,c);

          B = SAMPLE A 1/$number;

          dump B;
          {code}

          Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.

          Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.

          Ideal use case:

          {code}
          A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3);

          ...
          ...

          W = group X by col1;

          Z = foreach Y generate AVG(X);

          AA = load '/user/viraj/test' using PigStorage() as (a,b,c);

          BB = SAMPLE AA 1/Z;

          dump BB;
          {code}

          Viraj

          Limit should has the same case.
          This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011
          I have a script which takes in a command line parameter.

          {code}
          pig -p number=100 script.pig
          {code}

          The script contains the following parameters:

          {code}
          A = load '/user/viraj/test' using PigStorage() as (a,b,c);

          B = SAMPLE A 1/$number;

          dump B;
          {code}

          Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.

          Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.

          Ideal use case:

          {code}
          A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3);

          ...
          ...

          W = group X by col1;

          Z = foreach Y generate AVG(X);

          AA = load '/user/viraj/test' using PigStorage() as (a,b,c);

          BB = SAMPLE AA 1/Z;

          dump BB;
          {code}

          Viraj

          Change this Jira to only track sampling algorithm. PIG-1926 is opened to track limit/sample taking scalar.

          This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011
          Daniel Dai made changes -
          Labels gsoc2011
          Description I have a script which takes in a command line parameter.

          {code}
          pig -p number=100 script.pig
          {code}

          The script contains the following parameters:

          {code}
          A = load '/user/viraj/test' using PigStorage() as (a,b,c);

          B = SAMPLE A 1/$number;

          dump B;
          {code}

          Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.

          Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.

          Ideal use case:

          {code}
          A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3);

          ...
          ...

          W = group X by col1;

          Z = foreach Y generate AVG(X);

          AA = load '/user/viraj/test' using PigStorage() as (a,b,c);

          BB = SAMPLE AA 1/Z;

          dump BB;
          {code}

          Viraj
          I have a script which takes in a command line parameter.

          {code}
          pig -p number=100 script.pig
          {code}

          The script contains the following parameters:

          {code}
          A = load '/user/viraj/test' using PigStorage() as (a,b,c);

          B = SAMPLE A 1/$number;

          dump B;
          {code}

          Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.

          Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.

          Ideal use case:

          {code}
          A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3);

          ...
          ...

          W = group X by col1;

          Z = foreach Y generate AVG(X);

          AA = load '/user/viraj/test' using PigStorage() as (a,b,c);

          BB = SAMPLE AA 1/Z;

          dump BB;
          {code}

          Viraj

          Limit should has the same case.
          This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011
          Olga Natkovich made changes -
          Fix Version/s 0.10 [ 12316246 ]
          Olga Natkovich made changes -
          Fix Version/s 0.9.0 [ 12315191 ]
          Olga Natkovich made changes -
          Field Original Value New Value
          Fix Version/s 0.9.0 [ 12315191 ]
          Viraj Bhat created issue -

            People

            • Assignee:
              Unassigned
              Reporter:
              Viraj Bhat
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:

                Development