Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.2
    • Fix Version/s: 3.3
    • Labels:
    • Environment:

      All

      Description

      Implement the Kendall's Tau which is a measure of Association/Correlation between ranked ordinal data.

      A basic description is available at http://en.wikipedia.org/wiki/Kendall_tau_rank_correlation_coefficient however the test implementation will follow that defined by "Handbook of Parametric and Nonparametric Statistical Procedures, Fifth Edition, Page 1393 Test 30, ISBN-10: 1439858012 | ISBN-13: 978-1439858011."

      The algorithm is proposed as follows.

      Given two rankings or permutations represented by a 2D matrix; columns indicate rankings (e.g. by an individual) and row are observations of each rank. The algorithm is to calculate the total number of concordant pairs of ranks (between columns), discordant pairs of ranks (between columns) and calculate the Tau defined as

      tau= (Number of concordant - number of discordant)/(n(n-1)/2)
      where n(n-1)/2 is the total number of possible pairs of ranks.

      The method will then output the tau value between -1 and 1 where 1 signifies a "perfect" correlation between the two ranked lists.

      Where ties exist within a ranking it is marked as neither concordant nor discordant in the calculation. An optional merge sort can be used to speed up the implementation. Details are in the wiki page.

      Although this implementation is not particularly complex it would be useful to have it in a consistent format in the commons math package in addition to existing correlation tests. Kendall's Tau is used effectively in comparing ranks for products, rankings from search engines or measurements from engineering equipment.

        Activity

        Hide
        Dev Lakhani added a comment -

        Initial feedback from Phil Steitz

        I think a Kendal's Tau implementation would make a great addition to
        the correlation package (o.a.c.math3.stat.correlation). Here is how
        you can get started:

        0) Get yourself set up to build commons math and run the unit
        tests. If you are familiar with maven, this should not be too
        hard. If you have any questions or run into problems checking out
        the sources, building locally, etc., don't hesitate to ask.
        1) Look at the Spearman's implementation and the ranking classes in
        the stat.ranking package. That might give you some ideas on how to
        implement Kendal's consistently.
        2) Open a JIRA ticket with the info above and start attaching
        patches implementing the new implementation class and associated
        test class. Run "mvn site" or checkstyle standalone to make sure
        your contributed code follows the style guidelines we use.
        3) Be patient but persistent and we will get Kendall's Tau into
        commons math

        Show
        Dev Lakhani added a comment - Initial feedback from Phil Steitz I think a Kendal's Tau implementation would make a great addition to the correlation package (o.a.c.math3.stat.correlation). Here is how you can get started: 0) Get yourself set up to build commons math and run the unit tests. If you are familiar with maven, this should not be too hard. If you have any questions or run into problems checking out the sources, building locally, etc., don't hesitate to ask. 1) Look at the Spearman's implementation and the ranking classes in the stat.ranking package. That might give you some ideas on how to implement Kendal's consistently. 2) Open a JIRA ticket with the info above and start attaching patches implementing the new implementation class and associated test class. Run "mvn site" or checkstyle standalone to make sure your contributed code follows the style guidelines we use. 3) Be patient but persistent and we will get Kendall's Tau into commons math
        Hide
        Matt Adereth added a comment -

        I've attached an implementation and tests of this using the algorithm described in William R. Knight's 1966 paper "A Computer Method for Calculating Kendall's Tau with Ungrouped Data" in the Journal of the American Statistical Association.

        I've never contributed before, so apologies if there's anything I should be doing differently. I have signed and submitted the contributer agreement.

        I'm looking forward to some feedback and hopefully getting this in... thanks!

        Show
        Matt Adereth added a comment - I've attached an implementation and tests of this using the algorithm described in William R. Knight's 1966 paper "A Computer Method for Calculating Kendall's Tau with Ungrouped Data" in the Journal of the American Statistical Association. I've never contributed before, so apologies if there's anything I should be doing differently. I have signed and submitted the contributer agreement. I'm looking forward to some feedback and hopefully getting this in... thanks!
        Hide
        Matt Adereth added a comment -

        Apologies in advance for not being consistent with SpearmansCorrelation. I couldn't bring myself to make these methods non-static or to use the approach of having the constructor take the data. If this is a problem, I have no issue making the requisite change.

        Another consideration that I didn't do, but would be willing to try, would be to make a new Correlation interface that exposes the methods that should be common to Spearmans, Kendalls, and Pearsons. It would probably make sense to then have an AbstractCorrelation base class that handles the sheparding of data between Matrix, double[][], and double[], double[].

        Finally, if you do think a Correlation interface makes sense, I'd also like to propose a NonParametricCorrelation interface for Spearmans and Kendalls which would have an additional method for computing the Correlation between two List<Comparable> objects.

        Show
        Matt Adereth added a comment - Apologies in advance for not being consistent with SpearmansCorrelation. I couldn't bring myself to make these methods non-static or to use the approach of having the constructor take the data. If this is a problem, I have no issue making the requisite change. Another consideration that I didn't do, but would be willing to try, would be to make a new Correlation interface that exposes the methods that should be common to Spearmans, Kendalls, and Pearsons. It would probably make sense to then have an AbstractCorrelation base class that handles the sheparding of data between Matrix, double[][], and double[], double[]. Finally, if you do think a Correlation interface makes sense, I'd also like to propose a NonParametricCorrelation interface for Spearmans and Kendalls which would have an additional method for computing the Correlation between two List<Comparable> objects.
        Hide
        Thomas Neidhart added a comment -

        Hi Matt,

        added your patch in r1537660 with some modifications:

        • use FastMath instead of Math
        • javadoc formatting and linewidth
        • use same class interface as other correlations
        • use existing Pair class instead of ComparablePair
        • simplify the correlation method to only support double[] atm (may be extended if needed)
        • added testcases for longley and swiss fertility data sets based on our correlation testsuite for R

        Thanks a lot for your contribution!

        The other points wrt a commons base class / interface are perfectly valid and I would be very much in favor. It should be fairly easy to introduce an abstract base class Correlation for the 3 implementations that we have right now.

        btw. we prefer abstract base classes over interfaces, as they are more or less the same as interfaces, but make it possible to extend without breaking compatibility.

        We should create a separate issue for this, feel free to work on this already and provide a patch if you are interested.

        Show
        Thomas Neidhart added a comment - Hi Matt, added your patch in r1537660 with some modifications: use FastMath instead of Math javadoc formatting and linewidth use same class interface as other correlations use existing Pair class instead of ComparablePair simplify the correlation method to only support double[] atm (may be extended if needed) added testcases for longley and swiss fertility data sets based on our correlation testsuite for R Thanks a lot for your contribution! The other points wrt a commons base class / interface are perfectly valid and I would be very much in favor. It should be fairly easy to introduce an abstract base class Correlation for the 3 implementations that we have right now. btw. we prefer abstract base classes over interfaces, as they are more or less the same as interfaces, but make it possible to extend without breaking compatibility. We should create a separate issue for this, feel free to work on this already and provide a patch if you are interested.
        Hide
        Thomas Neidhart added a comment -

        Resolving issue as there were no further comments.
        For the common Correlation interface / case class please create a new issue.

        Show
        Thomas Neidhart added a comment - Resolving issue as there were no further comments. For the common Correlation interface / case class please create a new issue.
        Hide
        Luc Maisonobe added a comment -

        Closing all resolved issue now available in released 3.3 version.

        Show
        Luc Maisonobe added a comment - Closing all resolved issue now available in released 3.3 version.

          People

          • Assignee:
            Phil Steitz
            Reporter:
            Dev Lakhani
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 840h
              840h
              Remaining:
              Remaining Estimate - 840h
              840h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development