Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      Implement a bootstrap sampling option ( http://en.wikipedia.org/wiki/Bootstrap_(statistics) ) in Pig's SAMPLE operator.

        Issue Links

          Activity

          Show
          azaroth Gianmarco De Francisci Morales added a comment - Here an example http://hortonworks.com/blog/bootstrap-sampling-with-apache-pig
          Hide
          vickifu Vicki Fu added a comment -

          Hi I am willing to close this ticket in GSoC13.
          I already go through Pig Sample Code. Currently Pig has two sampling: RandomSampleLoader and SampleLoader. RandomSampleLoader is the basic sampling method to allocate a buffer for numSamples,and scan input and insert with random number position tuple. PoissonSampleLoader is using poisson cumulative distribution function to predict the probability that a partition has less than or equal to k samples.

          Bootstrapping is a method for deriving robust estimates of standard errors and confidence intervals for estimates such as the mean, median, proportion, odds ratio, correlation coefficient or regression coefficient. So it will keep statistic information during sampling.

          Algorithm For BootStrapping Sampling:
          1. Construct an empirical probability distribution 1/n, the sample, which is the nonparametric maximum likelihood estimate of
          the population distribution, w.
          2. draw a random sample of size n with replacement. This is a ‘resample’.
          3. Calculate the statistic of interest L.
          4. Repeat 2 and 3 more than n times.

          For BootStrap Sampling, current we can use R or Python script directly. But that is not big data solution, also it depends on R and related packages available.
          Implementation for BootStrap Sampling
          1. Add the parameters to support new Sampling.
          2. BootStrapSampleLoader will extend RandomSampleLoader
          3. Several statistic information will be collected: STDDEV, AVE, confidential Interval and so on.

          My plan is implement Bootstrap Sampling, Stratified Sampling and Reservoir Sampling. (I am not sure all can be finished in Gsoc timeframe, I still can work on it after summer time)

          Thanks
          Yu Fu
          PhD student in UMBC.

          Reference
          Davison, A. C., and D. V. Hinkley. 2006. Bootstrap Methods and their Application. : Cambridge University Press.
          Shao, J., and D. Tu. 1995. The Jackknife and Bootstrap. New York: Springer.
          MAD Skills: New Analysis Practices for Big Data http://db.cs.berkeley.edu/jmh/papers/madskills-032009.pdf
          On the Choice of m in the m out of n Bootstrap and its Application to Confidence Bounds for Extreme Percentiles http://www.stat.berkeley.edu/~bickel/BS2008SS.pdf http://www.stat.berkeley.edu/~bickel/BS2008SS.pdf

          Show
          vickifu Vicki Fu added a comment - Hi I am willing to close this ticket in GSoC13. I already go through Pig Sample Code. Currently Pig has two sampling: RandomSampleLoader and SampleLoader. RandomSampleLoader is the basic sampling method to allocate a buffer for numSamples,and scan input and insert with random number position tuple. PoissonSampleLoader is using poisson cumulative distribution function to predict the probability that a partition has less than or equal to k samples. Bootstrapping is a method for deriving robust estimates of standard errors and confidence intervals for estimates such as the mean, median, proportion, odds ratio, correlation coefficient or regression coefficient. So it will keep statistic information during sampling. Algorithm For BootStrapping Sampling: 1. Construct an empirical probability distribution 1/n, the sample, which is the nonparametric maximum likelihood estimate of the population distribution, w. 2. draw a random sample of size n with replacement. This is a ‘resample’. 3. Calculate the statistic of interest L. 4. Repeat 2 and 3 more than n times. For BootStrap Sampling, current we can use R or Python script directly. But that is not big data solution, also it depends on R and related packages available. Implementation for BootStrap Sampling 1. Add the parameters to support new Sampling. 2. BootStrapSampleLoader will extend RandomSampleLoader 3. Several statistic information will be collected: STDDEV, AVE, confidential Interval and so on. My plan is implement Bootstrap Sampling, Stratified Sampling and Reservoir Sampling. (I am not sure all can be finished in Gsoc timeframe, I still can work on it after summer time) Thanks Yu Fu PhD student in UMBC. Reference Davison, A. C., and D. V. Hinkley. 2006. Bootstrap Methods and their Application. : Cambridge University Press. Shao, J., and D. Tu. 1995. The Jackknife and Bootstrap. New York: Springer. MAD Skills: New Analysis Practices for Big Data http://db.cs.berkeley.edu/jmh/papers/madskills-032009.pdf On the Choice of m in the m out of n Bootstrap and its Application to Confidence Bounds for Extreme Percentiles http://www.stat.berkeley.edu/~bickel/BS2008SS.pdf http://www.stat.berkeley.edu/~bickel/BS2008SS.pdf
          Hide
          azaroth Gianmarco De Francisci Morales added a comment -

          Hi Vicky,

          Thanks for your interest in this project idea.

          Given that Pig is not a statistics only, my current understanding is that we want the samples to be materialized because they can be used, e.g., to train an ensemble classifier.
          Of course the case where we are only interested in statistics can be optimized.
          Maybe a UDF would do the trick in this latter case.

          Show
          azaroth Gianmarco De Francisci Morales added a comment - Hi Vicky, Thanks for your interest in this project idea. Given that Pig is not a statistics only, my current understanding is that we want the samples to be materialized because they can be used, e.g., to train an ensemble classifier. Of course the case where we are only interested in statistics can be optimized. Maybe a UDF would do the trick in this latter case.
          Hide
          vickifu Vicki Fu added a comment -

          Thank you Gianmarco.
          The output of the sampling is k set of resample data. If the small data run in R using a matrix as the input could be:
          --R code as the following will be easy-
          A <- matrix(seq(1,100),10,10)
          k <- 10 # 10 bootstrap replicate set
          replicate(k, apply(A, 2, sample, replace = TRUE))

          Y, you are right, the statistics result can be collected by UDF.
          My plan is implement bootstrap, Reservoir and Stratified Sampling in order in this project.
          Please correct me if my understand is not right.
          Thanks
          Vicky

          Show
          vickifu Vicki Fu added a comment - Thank you Gianmarco. The output of the sampling is k set of resample data. If the small data run in R using a matrix as the input could be: -- R code as the following will be easy - A <- matrix(seq(1,100),10,10) k <- 10 # 10 bootstrap replicate set replicate(k, apply(A, 2, sample, replace = TRUE)) Y, you are right, the statistics result can be collected by UDF. My plan is implement bootstrap, Reservoir and Stratified Sampling in order in this project. Please correct me if my understand is not right. Thanks Vicky
          Hide
          vickifu Vicki Fu added a comment -

          Hi Gianmarco,
          I had finished the first draft of my GSOC 2013 proposal, Would you please give me some feedback?
          http://vickifu.info/?p=29
          Thanks
          Vicky

          Show
          vickifu Vicki Fu added a comment - Hi Gianmarco, I had finished the first draft of my GSOC 2013 proposal, Would you please give me some feedback? http://vickifu.info/?p=29 Thanks Vicky

            People

            • Assignee:
              Unassigned
              Reporter:
              azaroth Gianmarco De Francisci Morales
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:

                Development