Pig
  1. Pig
  2. PIG-1713

SAMPLE command should accept parameters to specify alternative sampling algorithm

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      I have a script which takes in a command line parameter.

      pig -p number=100 script.pig
      

      The script contains the following parameters:

      A = load '/user/viraj/test' using PigStorage() as (a,b,c);
      
      B = SAMPLE A 1/$number;
      
      dump B;
      

      Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.

      Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.

      Ideal use case:

      A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3);
      
      ...
      ...
      
      W = group X by col1;
      
      Z = foreach Y generate AVG(X);
      
      AA = load '/user/viraj/test' using PigStorage() as (a,b,c);
      
      BB = SAMPLE AA 1/Z;
      
      dump BB;
      

      Viraj

      Change this Jira to only track sampling algorithm. PIG-1926 is opened to track limit/sample taking scalar.

      This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

        Issue Links

          Activity

          Hide
          Vicki Fu added a comment -

          Hi I had added this ticket into my gsco 2013 proposal.
          I had my first draft here, would you please give me some feedback?
          http://vickifu.info/?p=29

          Show
          Vicki Fu added a comment - Hi I had added this ticket into my gsco 2013 proposal. I had my first draft here, would you please give me some feedback? http://vickifu.info/?p=29
          Hide
          Dmitriy V. Ryaboy added a comment -

          When making changes to how SAMPLE works, please keep in mind PIG-2014 (letting the optimizer push this operator around is clearly dangerous).

          Show
          Dmitriy V. Ryaboy added a comment - When making changes to how SAMPLE works, please keep in mind PIG-2014 (letting the optimizer push this operator around is clearly dangerous).
          Hide
          Daniel Dai added a comment -

          I think it better to split this issue into two. One is for scalar, the other for sampling algorithm.

          First part, yes, mostly it is a frontend work.

          Second part, I think we can allow sample to take optional argument. The scope of work is still open. We need to decide which algorithm to use. And AFAIK, Ciemiewicz already working on reservoir sampling, we may need to integrate it into our framework.

          Show
          Daniel Dai added a comment - I think it better to split this issue into two. One is for scalar, the other for sampling algorithm. First part, yes, mostly it is a frontend work. Second part, I think we can allow sample to take optional argument. The scope of work is still open. We need to decide which algorithm to use. And AFAIK, Ciemiewicz already working on reservoir sampling, we may need to integrate it into our framework.
          Hide
          Gianmarco De Francisci Morales added a comment -

          To support the simple use case one would simply need to allow expressions in the SAMPLE argument.
          This should mainly require changes to the front-end I assume.

          For more complex techniques like reservoir one should implement a new (physical?) operator.
          What is the exact scope/goal of the project?

          Maybe it could be split in 2 parts. Supporting sampling with variable arguments as the first part, and adding more complex techniques as a second part?

          Show
          Gianmarco De Francisci Morales added a comment - To support the simple use case one would simply need to allow expressions in the SAMPLE argument. This should mainly require changes to the front-end I assume. For more complex techniques like reservoir one should implement a new (physical?) operator. What is the exact scope/goal of the project? Maybe it could be split in 2 parts. Supporting sampling with variable arguments as the first part, and adding more complex techniques as a second part?
          Hide
          David Ciemiewicz added a comment -

          An alternative might be to implement SAMPLE using Reservoir Sampling techniques, this way you never have to adjust the sampling probability - as long as N is greater than the sample size K, you'll always get exactly K elements.

          http://en.wikipedia.org/wiki/Reservoir_sampling

          Actually, to implement a scalable, parallel version of Reservoir Sampling that would work with Accumulator and Combiner interfaces, Weighted Reservoir Sampling (WRS) is required:

          http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf

          Show
          David Ciemiewicz added a comment - An alternative might be to implement SAMPLE using Reservoir Sampling techniques, this way you never have to adjust the sampling probability - as long as N is greater than the sample size K, you'll always get exactly K elements. http://en.wikipedia.org/wiki/Reservoir_sampling Actually, to implement a scalable, parallel version of Reservoir Sampling that would work with Accumulator and Combiner interfaces, Weighted Reservoir Sampling (WRS) is required: http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf
          Hide
          Olga Natkovich added a comment -

          A "maybe" for 0.9

          Show
          Olga Natkovich added a comment - A "maybe" for 0.9
          Hide
          Thejas M Nair added a comment -

          Once the first use case is supported (expressions parameter for SAMPLE), the ideal use case will also automatically work - thanks to the 'relation as scalar' feature introduced in PIG-1434 . Until this feature is available, a workaround is to use a filter statement with a udf that returns true based on the probability argument.

          Show
          Thejas M Nair added a comment - Once the first use case is supported (expressions parameter for SAMPLE), the ideal use case will also automatically work - thanks to the 'relation as scalar' feature introduced in PIG-1434 . Until this feature is available, a workaround is to use a filter statement with a udf that returns true based on the probability argument.

            People

            • Assignee:
              Unassigned
              Reporter:
              Viraj Bhat
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:

                Development