# Implementation of Single Sample T-Test using Map Reduce/Mahout

## Details

• Type: New Feature
• Status: Closed
• Priority: Major
• Resolution: Fixed
• Affects Version/s: 0.1
• Fix Version/s: None
• Component/s:
• Labels:
• Environment:

Linux, Mac OS, Hadoop 0.20.2, Mahout 0.x

## Description

Implement a map/reduce version of the single sample t test to test whether a sample of n subjects comes from a population in which the mean equals a particular value.

For a large dataset, say n millions of rows, one can test whether the sample (large as it is) comes from the population mean.

Input:
1) specified population mean to be tested against
2) hypothesis direction : i.e. "two.sided", "less", "greater".
3) confidence level or alpha
4) flag to indicate paired or not paired

The procedure is as follows:
1. Use Map/Reduce to calculate the mean of the sample.
2. Use Map/Reduce to calculate standard error of the population mean.
3. Use Map/Reduce to calculate the t statistic
4. Estimate the degrees of freedom depending on equal sample variances

Output
1) The value of the t-statistic.
2) The p-value for the test.
3) Flag that is true if the null hypothesis can be rejected with confidence 1 - alpha; false otherwise.

## Activity

Dev Lakhani created issue -
Hide
Ted Dunning added a comment -

I am not sure that I see the value here. All you need for this calculation is the means, the squared differences and the counts.

Do we really need this in Mahout when 3 lines of Pig suffice?

Show
Ted Dunning added a comment - I am not sure that I see the value here. All you need for this calculation is the means, the squared differences and the counts. Do we really need this in Mahout when 3 lines of Pig suffice?
Hide
Dev Lakhani added a comment -

I guess this was a naive attempt at trying to create a MR version of the Apache commons math/statistics package. Following this implementation, the idea is to go on to extend to ANOVAs, Wilcoxon Tests, Pearson correlations, Kolmogrov-Smirnov and other R like features (but in MR).

Yup it could be done in Pig but it's maybe likely to need a UDF e.g. the TTest in commons math defines the TDistribution for lookup of statistical values so perhaps it's better doing the whole thing in Java. This also makes it easier to test and control/tune the MR jobs.

I was just trying to test the waters really and see if there is support for this; if so then there are plenty of basic stats tests than can be implemented for big data. This will require a bit of help from the community. If not please feel free to close this entry.

Cheers

Show
Dev Lakhani added a comment - I guess this was a naive attempt at trying to create a MR version of the Apache commons math/statistics package. Following this implementation, the idea is to go on to extend to ANOVAs, Wilcoxon Tests, Pearson correlations, Kolmogrov-Smirnov and other R like features (but in MR). Yup it could be done in Pig but it's maybe likely to need a UDF e.g. the TTest in commons math defines the TDistribution for lookup of statistical values so perhaps it's better doing the whole thing in Java. This also makes it easier to test and control/tune the MR jobs. I was just trying to test the waters really and see if there is support for this; if so then there are plenty of basic stats tests than can be implemented for big data. This will require a bit of help from the community. If not please feel free to close this entry. Cheers
Hide
Sebastian Schelter added a comment -

Closing this as it hasn't been picked up for several months

Show
Sebastian Schelter added a comment - Closing this as it hasn't been picked up for several months
Field Original Value New Value
Status Open [ 1 ] Resolved [ 5 ]
Resolution Fixed [ 1 ]
 Status Resolved [ 5 ] Closed [ 6 ]
 Affects Version/s 0.1 [ 12312976 ] Affects Version/s Backlog [ 12318886 ] Fix Version/s Backlog [ 12318886 ]
Transition Time In Source Status Execution Times Last Executer Last Execution Date
 Open Resolved
324d 21h 13m 1 Sebastian Schelter 11/Mar/13 15:39
 Resolved Closed
328d 16h 27m 1 Suneel Marthi 03/Feb/14 08:06

## People

• Assignee:
Unassigned
Reporter:
Dev Lakhani
0 Vote for this issue
Watchers:
3 Start watching this issue

## Dates

• Created:
Updated:
Resolved:

## Time Tracking

Estimated:
672h
Remaining:
672h
Logged:
Not Specified