Pig
  1. Pig
  2. PIG-1713

SAMPLE command should accept parameters to specify alternative sampling algorithm

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      I have a script which takes in a command line parameter.

      pig -p number=100 script.pig
      

      The script contains the following parameters:

      A = load '/user/viraj/test' using PigStorage() as (a,b,c);
      
      B = SAMPLE A 1/$number;
      
      dump B;
      

      Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.

      Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.

      Ideal use case:

      A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3);
      
      ...
      ...
      
      W = group X by col1;
      
      Z = foreach Y generate AVG(X);
      
      AA = load '/user/viraj/test' using PigStorage() as (a,b,c);
      
      BB = SAMPLE AA 1/Z;
      
      dump BB;
      

      Viraj

      Change this Jira to only track sampling algorithm. PIG-1926 is opened to track limit/sample taking scalar.

      This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

        Issue Links

          Activity

          Hide
          Vicki Fu added a comment -

          Hi I had added this ticket into my gsco 2013 proposal.
          I had my first draft here, would you please give me some feedback?
          http://vickifu.info/?p=29

          Show
          Vicki Fu added a comment - Hi I had added this ticket into my gsco 2013 proposal. I had my first draft here, would you please give me some feedback? http://vickifu.info/?p=29
          Gianmarco De Francisci Morales made changes -
          Link This issue relates to PIG-3224 [ PIG-3224 ]
          Gianmarco De Francisci Morales made changes -
          Link This issue relates to PIG-3221 [ PIG-3221 ]
          Gianmarco De Francisci Morales made changes -
          Link This issue relates to PIG-3225 [ PIG-3225 ]
          Daniel Dai made changes -
          Description I have a script which takes in a command line parameter.

          {code}
          pig -p number=100 script.pig
          {code}

          The script contains the following parameters:

          {code}
          A = load '/user/viraj/test' using PigStorage() as (a,b,c);

          B = SAMPLE A 1/$number;

          dump B;
          {code}

          Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.

          Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.

          Ideal use case:

          {code}
          A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3);

          ...
          ...

          W = group X by col1;

          Z = foreach Y generate AVG(X);

          AA = load '/user/viraj/test' using PigStorage() as (a,b,c);

          BB = SAMPLE AA 1/Z;

          dump BB;
          {code}

          Viraj

          Change this Jira to only track sampling algorithm. PIG-1926 is opened to track limit/sample taking scalar.

          This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011
          I have a script which takes in a command line parameter.

          {code}
          pig -p number=100 script.pig
          {code}

          The script contains the following parameters:

          {code}
          A = load '/user/viraj/test' using PigStorage() as (a,b,c);

          B = SAMPLE A 1/$number;

          dump B;
          {code}

          Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.

          Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.

          Ideal use case:

          {code}
          A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3);

          ...
          ...

          W = group X by col1;

          Z = foreach Y generate AVG(X);

          AA = load '/user/viraj/test' using PigStorage() as (a,b,c);

          BB = SAMPLE AA 1/Z;

          dump BB;
          {code}

          Viraj

          Change this Jira to only track sampling algorithm. PIG-1926 is opened to track limit/sample taking scalar.

          This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012
          Daniel Dai made changes -
          Labels gsoc2011 gsoc2012
          Olga Natkovich made changes -
          Fix Version/s 0.10 [ 12316246 ]
          Hide
          Dmitriy V. Ryaboy added a comment -

          When making changes to how SAMPLE works, please keep in mind PIG-2014 (letting the optimizer push this operator around is clearly dangerous).

          Show
          Dmitriy V. Ryaboy added a comment - When making changes to how SAMPLE works, please keep in mind PIG-2014 (letting the optimizer push this operator around is clearly dangerous).
          Daniel Dai made changes -
          Summary SAMPLE command should accept parameters SAMPLE command should accept parameters to specify alternative sampling algorithm
          Description I have a script which takes in a command line parameter.

          {code}
          pig -p number=100 script.pig
          {code}

          The script contains the following parameters:

          {code}
          A = load '/user/viraj/test' using PigStorage() as (a,b,c);

          B = SAMPLE A 1/$number;

          dump B;
          {code}

          Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.

          Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.

          Ideal use case:

          {code}
          A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3);

          ...
          ...

          W = group X by col1;

          Z = foreach Y generate AVG(X);

          AA = load '/user/viraj/test' using PigStorage() as (a,b,c);

          BB = SAMPLE AA 1/Z;

          dump BB;
          {code}

          Viraj

          Limit should has the same case.
          This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011
          I have a script which takes in a command line parameter.

          {code}
          pig -p number=100 script.pig
          {code}

          The script contains the following parameters:

          {code}
          A = load '/user/viraj/test' using PigStorage() as (a,b,c);

          B = SAMPLE A 1/$number;

          dump B;
          {code}

          Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.

          Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.

          Ideal use case:

          {code}
          A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3);

          ...
          ...

          W = group X by col1;

          Z = foreach Y generate AVG(X);

          AA = load '/user/viraj/test' using PigStorage() as (a,b,c);

          BB = SAMPLE AA 1/Z;

          dump BB;
          {code}

          Viraj

          Change this Jira to only track sampling algorithm. PIG-1926 is opened to track limit/sample taking scalar.

          This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011
          Hide
          Daniel Dai added a comment -

          I think it better to split this issue into two. One is for scalar, the other for sampling algorithm.

          First part, yes, mostly it is a frontend work.

          Second part, I think we can allow sample to take optional argument. The scope of work is still open. We need to decide which algorithm to use. And AFAIK, Ciemiewicz already working on reservoir sampling, we may need to integrate it into our framework.

          Show
          Daniel Dai added a comment - I think it better to split this issue into two. One is for scalar, the other for sampling algorithm. First part, yes, mostly it is a frontend work. Second part, I think we can allow sample to take optional argument. The scope of work is still open. We need to decide which algorithm to use. And AFAIK, Ciemiewicz already working on reservoir sampling, we may need to integrate it into our framework.
          Hide
          Gianmarco De Francisci Morales added a comment -

          To support the simple use case one would simply need to allow expressions in the SAMPLE argument.
          This should mainly require changes to the front-end I assume.

          For more complex techniques like reservoir one should implement a new (physical?) operator.
          What is the exact scope/goal of the project?

          Maybe it could be split in 2 parts. Supporting sampling with variable arguments as the first part, and adding more complex techniques as a second part?

          Show
          Gianmarco De Francisci Morales added a comment - To support the simple use case one would simply need to allow expressions in the SAMPLE argument. This should mainly require changes to the front-end I assume. For more complex techniques like reservoir one should implement a new (physical?) operator. What is the exact scope/goal of the project? Maybe it could be split in 2 parts. Supporting sampling with variable arguments as the first part, and adding more complex techniques as a second part?
          Daniel Dai made changes -
          Labels gsoc2011
          Description I have a script which takes in a command line parameter.

          {code}
          pig -p number=100 script.pig
          {code}

          The script contains the following parameters:

          {code}
          A = load '/user/viraj/test' using PigStorage() as (a,b,c);

          B = SAMPLE A 1/$number;

          dump B;
          {code}

          Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.

          Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.

          Ideal use case:

          {code}
          A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3);

          ...
          ...

          W = group X by col1;

          Z = foreach Y generate AVG(X);

          AA = load '/user/viraj/test' using PigStorage() as (a,b,c);

          BB = SAMPLE AA 1/Z;

          dump BB;
          {code}

          Viraj
          I have a script which takes in a command line parameter.

          {code}
          pig -p number=100 script.pig
          {code}

          The script contains the following parameters:

          {code}
          A = load '/user/viraj/test' using PigStorage() as (a,b,c);

          B = SAMPLE A 1/$number;

          dump B;
          {code}

          Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.

          Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.

          Ideal use case:

          {code}
          A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3);

          ...
          ...

          W = group X by col1;

          Z = foreach Y generate AVG(X);

          AA = load '/user/viraj/test' using PigStorage() as (a,b,c);

          BB = SAMPLE AA 1/Z;

          dump BB;
          {code}

          Viraj

          Limit should has the same case.
          This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011
          Olga Natkovich made changes -
          Fix Version/s 0.10 [ 12316246 ]
          Olga Natkovich made changes -
          Fix Version/s 0.9.0 [ 12315191 ]
          Hide
          David Ciemiewicz added a comment -

          An alternative might be to implement SAMPLE using Reservoir Sampling techniques, this way you never have to adjust the sampling probability - as long as N is greater than the sample size K, you'll always get exactly K elements.

          http://en.wikipedia.org/wiki/Reservoir_sampling

          Actually, to implement a scalable, parallel version of Reservoir Sampling that would work with Accumulator and Combiner interfaces, Weighted Reservoir Sampling (WRS) is required:

          http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf

          Show
          David Ciemiewicz added a comment - An alternative might be to implement SAMPLE using Reservoir Sampling techniques, this way you never have to adjust the sampling probability - as long as N is greater than the sample size K, you'll always get exactly K elements. http://en.wikipedia.org/wiki/Reservoir_sampling Actually, to implement a scalable, parallel version of Reservoir Sampling that would work with Accumulator and Combiner interfaces, Weighted Reservoir Sampling (WRS) is required: http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf
          Olga Natkovich made changes -
          Field Original Value New Value
          Fix Version/s 0.9.0 [ 12315191 ]
          Hide
          Olga Natkovich added a comment -

          A "maybe" for 0.9

          Show
          Olga Natkovich added a comment - A "maybe" for 0.9
          Hide
          Thejas M Nair added a comment -

          Once the first use case is supported (expressions parameter for SAMPLE), the ideal use case will also automatically work - thanks to the 'relation as scalar' feature introduced in PIG-1434 . Until this feature is available, a workaround is to use a filter statement with a udf that returns true based on the probability argument.

          Show
          Thejas M Nair added a comment - Once the first use case is supported (expressions parameter for SAMPLE), the ideal use case will also automatically work - thanks to the 'relation as scalar' feature introduced in PIG-1434 . Until this feature is available, a workaround is to use a filter statement with a udf that returns true based on the probability argument.
          Viraj Bhat created issue -

            People

            • Assignee:
              Unassigned
              Reporter:
              Viraj Bhat
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:

                Development