Hi I am willing to close this ticket in GSoC13.

I already go through Pig Sample Code. Currently Pig has two sampling: RandomSampleLoader and SampleLoader. RandomSampleLoader is the basic sampling method to allocate a buffer for numSamples,and scan input and insert with random number position tuple. PoissonSampleLoader is using poisson cumulative distribution function to predict the probability that a partition has less than or equal to k samples.

Bootstrapping is a method for deriving robust estimates of standard errors and confidence intervals for estimates such as the mean, median, proportion, odds ratio, correlation coefficient or regression coefficient. So it will keep statistic information during sampling.

Algorithm For BootStrapping Sampling:

1. Construct an empirical probability distribution 1/n, the sample, which is the nonparametric maximum likelihood estimate of

the population distribution, w.

2. draw a random sample of size n with replacement. This is a ‘resample’.

3. Calculate the statistic of interest L.

4. Repeat 2 and 3 more than n times.

For BootStrap Sampling, current we can use R or Python script directly. But that is not big data solution, also it depends on R and related packages available.

Implementation for BootStrap Sampling

1. Add the parameters to support new Sampling.

2. BootStrapSampleLoader will extend RandomSampleLoader

3. Several statistic information will be collected: STDDEV, AVE, confidential Interval and so on.

My plan is implement Bootstrap Sampling, Stratified Sampling and Reservoir Sampling. (I am not sure all can be finished in Gsoc timeframe, I still can work on it after summer time)

Thanks

Yu Fu

PhD student in UMBC.

Reference

Davison, A. C., and D. V. Hinkley. 2006. Bootstrap Methods and their Application. : Cambridge University Press.

Shao, J., and D. Tu. 1995. The Jackknife and Bootstrap. New York: Springer.

MAD Skills: New Analysis Practices for Big Data http://db.cs.berkeley.edu/jmh/papers/madskills-032009.pdf

On the Choice of m in the m out of n Bootstrap and its Application to Confidence Bounds for Extreme Percentiles http://www.stat.berkeley.edu/~bickel/BS2008SS.pdf http://www.stat.berkeley.edu/~bickel/BS2008SS.pdf

Here an example http://hortonworks.com/blog/bootstrap-sampling-with-apache-pig