Mahout
  1. Mahout
  2. MAHOUT-1000

Implementation of Single Sample T-Test using Map Reduce/Mahout

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.1
    • Fix Version/s: None
    • Component/s: Math
    • Labels:
    • Environment:

      Linux, Mac OS, Hadoop 0.20.2, Mahout 0.x

      Description

      Implement a map/reduce version of the single sample t test to test whether a sample of n subjects comes from a population in which the mean equals a particular value.

      For a large dataset, say n millions of rows, one can test whether the sample (large as it is) comes from the population mean.

      Input:
      1) specified population mean to be tested against
      2) hypothesis direction : i.e. "two.sided", "less", "greater".
      3) confidence level or alpha
      4) flag to indicate paired or not paired

      The procedure is as follows:
      1. Use Map/Reduce to calculate the mean of the sample.
      2. Use Map/Reduce to calculate standard error of the population mean.
      3. Use Map/Reduce to calculate the t statistic
      4. Estimate the degrees of freedom depending on equal sample variances

      Output
      1) The value of the t-statistic.
      2) The p-value for the test.
      3) Flag that is true if the null hypothesis can be rejected with confidence 1 - alpha; false otherwise.

      References
      http://www.basic.nwu.edu/statguidefiles/ttest_unpaired_ass_viol.html

        Activity

        Hide
        Ted Dunning added a comment -

        I am not sure that I see the value here. All you need for this calculation is the means, the squared differences and the counts.

        Do we really need this in Mahout when 3 lines of Pig suffice?

        Show
        Ted Dunning added a comment - I am not sure that I see the value here. All you need for this calculation is the means, the squared differences and the counts. Do we really need this in Mahout when 3 lines of Pig suffice?
        Hide
        Dev Lakhani added a comment -

        I guess this was a naive attempt at trying to create a MR version of the Apache commons math/statistics package. Following this implementation, the idea is to go on to extend to ANOVAs, Wilcoxon Tests, Pearson correlations, Kolmogrov-Smirnov and other R like features (but in MR).

        Yup it could be done in Pig but it's maybe likely to need a UDF e.g. the TTest in commons math defines the TDistribution for lookup of statistical values so perhaps it's better doing the whole thing in Java. This also makes it easier to test and control/tune the MR jobs.

        I was just trying to test the waters really and see if there is support for this; if so then there are plenty of basic stats tests than can be implemented for big data. This will require a bit of help from the community. If not please feel free to close this entry.

        Cheers

        Show
        Dev Lakhani added a comment - I guess this was a naive attempt at trying to create a MR version of the Apache commons math/statistics package. Following this implementation, the idea is to go on to extend to ANOVAs, Wilcoxon Tests, Pearson correlations, Kolmogrov-Smirnov and other R like features (but in MR). Yup it could be done in Pig but it's maybe likely to need a UDF e.g. the TTest in commons math defines the TDistribution for lookup of statistical values so perhaps it's better doing the whole thing in Java. This also makes it easier to test and control/tune the MR jobs. I was just trying to test the waters really and see if there is support for this; if so then there are plenty of basic stats tests than can be implemented for big data. This will require a bit of help from the community. If not please feel free to close this entry. Cheers
        Hide
        Sebastian Schelter added a comment -

        Closing this as it hasn't been picked up for several months

        Show
        Sebastian Schelter added a comment - Closing this as it hasn't been picked up for several months

          People

          • Assignee:
            Unassigned
            Reporter:
            Dev Lakhani
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 672h
              672h
              Remaining:
              Remaining Estimate - 672h
              672h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development