## Details

## Description

Implement a map/reduce version of the single sample t test to test whether a sample of n subjects comes from a population in which the mean equals a particular value.

For a large dataset, say n millions of rows, one can test whether the sample (large as it is) comes from the population mean.

Input:

1) specified population mean to be tested against

2) hypothesis direction : i.e. "two.sided", "less", "greater".

3) confidence level or alpha

4) flag to indicate paired or not paired

The procedure is as follows:

1. Use Map/Reduce to calculate the mean of the sample.

2. Use Map/Reduce to calculate standard error of the population mean.

3. Use Map/Reduce to calculate the t statistic

4. Estimate the degrees of freedom depending on equal sample variances

Output

1) The value of the t-statistic.

2) The p-value for the test.

3) Flag that is true if the null hypothesis can be rejected with confidence 1 - alpha; false otherwise.

References

http://www.basic.nwu.edu/statguidefiles/ttest_unpaired_ass_viol.html

I am not sure that I see the value here. All you need for this calculation is the means, the squared differences and the counts.

Do we really need this in Mahout when 3 lines of Pig suffice?