Details

    • Type: New Feature New Feature
    • Status: Reopened
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 4.0
    • Labels:
      None

      Description

      Kolmogorov-Smirnov test (see [1]) is used to test if one sample against a known probability density functions or if two samples are from the same distribution. To evaluate the test statistic, the Kolmogorov-Smirnov distribution is used. Quite good asymptotics exist for the one-sided test, but it's more difficult for the two-sided test.

      [1]: http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test

      1. ks-distribution.patch
        9 kB
        Luc Maisonobe
      2. MATH437-with-test-take-1
        27 kB
        Mikkel Meyer Andersen

        Activity

        Hide
        Phil Steitz added a comment -

        First cut added in r1572335.

        I don't want to hold up 3.3 release for this, but pre- or post-3.3 still need to:

        1. Update user guide
        2. Add static methods to TestUtils

        Show
        Phil Steitz added a comment - First cut added in r1572335. I don't want to hold up 3.3 release for this, but pre- or post-3.3 still need to: 1. Update user guide 2. Add static methods to TestUtils
        Hide
        Luc Maisonobe added a comment -

        Sure, we can wait for this issue to be fixed before releasing 3.3 if you think it can be done soon.

        Show
        Luc Maisonobe added a comment - Sure, we can wait for this issue to be fixed before releasing 3.3 if you think it can be done soon.
        Hide
        Phil Steitz added a comment -

        I have the first part of this - the new class and deprecation of the old - just about ready to commit. I would like to slide this in 3.3 if I can have to the end of this week to finish it.

        Show
        Phil Steitz added a comment - I have the first part of this - the new class and deprecation of the old - just about ready to commit. I would like to slide this in 3.3 if I can have to the end of this week to finish it.
        Hide
        Luc Maisonobe added a comment -

        Postponing to 4.0, as per previous comments.

        Show
        Luc Maisonobe added a comment - Postponing to 4.0, as per previous comments.
        Hide
        Luc Maisonobe added a comment -

        +1 to postpone to 4.0.

        I have no opinion about the usefulness of the distribution itself, so if you think it would be worth removing this part and having only the test, this would be fine with me.

        Show
        Luc Maisonobe added a comment - +1 to postpone to 4.0. I have no opinion about the usefulness of the distribution itself, so if you think it would be worth removing this part and having only the test, this would be fine with me.
        Hide
        Phil Steitz added a comment -

        I think we should bump this to 4.0 or at least 3.3. It was probably a mistake to put K-S in the distribution package. The K-S distribution itself is of little practical usefulness (to my knowledge at least). I have never seen it used for anything but performing K-S tests. It is tricky enough to compute the distribution function itself with any kind of numerical stability, as the comments above and the literature around K-S tests confirm. Computing moments is, as the reference where Luc (resourcefully!) found test data states, "intractable." I think it may be best to steer clear of this and focus on just getting good implementation of the test itself, which should move to .inference. I would prefer to do a little more research though to decide how best to set up the API and implementation for the test. It could be we would be better off not using the cdfs in the current impl, instead using beta approximation to compute p-values as in [1]. Note also that since discussion above / initial implementation, Simard has published [2] with some empirical findings on how the various K-S approximation methods perform.

        So to summarize, I think the first step is to agree on the K-S test API. Then deprecate the class in .distribution and move the test class to .inference.

        [1]http://www.ism.ac.jp/editsec/aism/pdf/054_3_0577.pdf
        [2] http://www.jstatsoft.org/v39/i11/paper

        Show
        Phil Steitz added a comment - I think we should bump this to 4.0 or at least 3.3. It was probably a mistake to put K-S in the distribution package. The K-S distribution itself is of little practical usefulness (to my knowledge at least). I have never seen it used for anything but performing K-S tests. It is tricky enough to compute the distribution function itself with any kind of numerical stability, as the comments above and the literature around K-S tests confirm. Computing moments is, as the reference where Luc (resourcefully!) found test data states, "intractable." I think it may be best to steer clear of this and focus on just getting good implementation of the test itself, which should move to .inference. I would prefer to do a little more research though to decide how best to set up the API and implementation for the test. It could be we would be better off not using the cdfs in the current impl, instead using beta approximation to compute p-values as in [1] . Note also that since discussion above / initial implementation, Simard has published [2] with some empirical findings on how the various K-S approximation methods perform. So to summarize, I think the first step is to agree on the K-S test API. Then deprecate the class in .distribution and move the test class to .inference. [1] http://www.ism.ac.jp/editsec/aism/pdf/054_3_0577.pdf [2] http://www.jstatsoft.org/v39/i11/paper
        Hide
        Phil Steitz added a comment -

        Pushing this to top of my [math] list. The patch makes great use of our numerics but I have to think there are better ways to compute these things. Also, we need to separate the test class, which should go into .stat.inference from the distribution class. If this is the last bug holding up 3.2, feel free to bump to 4.0 or 3.3.

        Show
        Phil Steitz added a comment - Pushing this to top of my [math] list. The patch makes great use of our numerics but I have to think there are better ways to compute these things. Also, we need to separate the test class, which should go into .stat.inference from the distribution class. If this is the last bug holding up 3.2, feel free to bump to 4.0 or 3.3.
        Hide
        Luc Maisonobe added a comment -

        Would the ks-distribution.patch patch be an acceptable way to implement the real distribution interface and solve the issue?

        Note that the implementations is really basic numerical computation using the cumulative probability as the reference. There are surely much better ways to do that. In particular, there seems to be exact formulas to compute the first moments (see the comments in the code and in the test case).

        I did not set up the test case to implement the abstract test shared by all distributions, but it would need to be done too.

        If the patch is not acceptable, could we push this issue to 4.0?

        Show
        Luc Maisonobe added a comment - Would the ks-distribution.patch patch be an acceptable way to implement the real distribution interface and solve the issue? Note that the implementations is really basic numerical computation using the cumulative probability as the reference. There are surely much better ways to do that. In particular, there seems to be exact formulas to compute the first moments (see the comments in the code and in the test case). I did not set up the test case to implement the abstract test shared by all distributions, but it would need to be done too. If the patch is not acceptable, could we push this issue to 4.0?
        Hide
        Phil Steitz added a comment -

        The implementation class is in the distribution package, but does not implement a distribution interface. We should either implement the missing methods or move this class (probably renamed) to .stat.inference. In either case, we should implement methods making it easy to set up and execute K-S tests.

        Show
        Phil Steitz added a comment - The implementation class is in the distribution package, but does not implement a distribution interface. We should either implement the missing methods or move this class (probably renamed) to .stat.inference. In either case, we should implement methods making it easy to set up and execute K-S tests.
        Hide
        Mikkel Meyer Andersen added a comment -

        Fixed in revision 1083716. Impl-class now mentions

        {@link KolmogorovSmirnovDistribution}

        , if that was what you meant, Phil? Power functionality from MATH-435 now used and the private methods removed. Note that - as before - the tests are made using src/test/R/KolmogorovSmirnovDistributionTestCases.R.

        Show
        Mikkel Meyer Andersen added a comment - Fixed in revision 1083716. Impl-class now mentions {@link KolmogorovSmirnovDistribution} , if that was what you meant, Phil? Power functionality from MATH-435 now used and the private methods removed. Note that - as before - the tests are made using src/test/R/KolmogorovSmirnovDistributionTestCases.R.
        Hide
        Phil Steitz added a comment -

        +1 to commit as is, adding some algorithm notes to the class javadoc and the MATH-435 power impl. I am ambivalent on whether or not to "fix" the error in Marsaglia's code that is apparently included in R. Having the verification tests is good, though, so I would leave as is in the patch, since the Marsaglia C impl can be seen as a reference in this case. I can see the other side of the argument here, though and would be fine with just going with the fixed code, suitably documented. What do others think about this?

        It looks like you forgot to add the references to the class javadoc for the impl class.

        Per comment on MATH-435, I think we should add the matrix power impl there and use it here.

        Show
        Phil Steitz added a comment - +1 to commit as is, adding some algorithm notes to the class javadoc and the MATH-435 power impl. I am ambivalent on whether or not to "fix" the error in Marsaglia's code that is apparently included in R. Having the verification tests is good, though, so I would leave as is in the patch, since the Marsaglia C impl can be seen as a reference in this case. I can see the other side of the argument here, though and would be fine with just going with the fixed code, suitably documented. What do others think about this? It looks like you forgot to add the references to the class javadoc for the impl class. Per comment on MATH-435 , I think we should add the matrix power impl there and use it here.
        Hide
        Mikkel Meyer Andersen added a comment -

        A proposal based on the method mainly by Marsaglia et al. optimised for extreme values as described by Simard et al. This method delivers consistent results comparing to R on the test cases tried.

        Show
        Mikkel Meyer Andersen added a comment - A proposal based on the method mainly by Marsaglia et al. optimised for extreme values as described by Simard et al. This method delivers consistent results comparing to R on the test cases tried.
        Hide
        Mikkel Meyer Andersen added a comment -

        In the past months, I've communicated with both Richard Simard and George Marsaglia regarding small disagreement between theory in Marsaglia's article and the actual implementation; namely the fact that 0 <= h < 1, but in the code 0 < h <= 1. I wrote to Marsaglia regarding this, and his answer was:

        The Kolmogorov distribution comes from a piecewise polynomial in h with knots at 1/2n, 2/2n,...,(2n-1)/2n, with each segment assumed to start with h=0. Although I emphasized that 0<= h <1 in the article, I overlooked the need for ensuring that in the C code, and apparently so did my colleagues. Sorry about that.

        This means that his code has to be changed slightly to ensure that 0 <= h < 1. Simard argues that this shouldn't mean anything because KS distribution is continuous, but if one wants to correct it, one should

        Instead of taking the floor(n*d + 1) and making this correction for h = 1, take the ceiling (n*d).

        I would prefer using ceiling (n*d) instead of the originally (wrongly) proposed floor(n*d + 1), despite arguments of continuity. So my plan is to do this (I still have my implementation which seem to work quite okay). The only problem is that R seems to use Marsaglia's code, and I don't have access to e.g. Mathematica which should implement several algorithms, so I might run into problems when I have to perform tests.

        Show
        Mikkel Meyer Andersen added a comment - In the past months, I've communicated with both Richard Simard and George Marsaglia regarding small disagreement between theory in Marsaglia's article and the actual implementation; namely the fact that 0 <= h < 1, but in the code 0 < h <= 1. I wrote to Marsaglia regarding this, and his answer was: The Kolmogorov distribution comes from a piecewise polynomial in h with knots at 1/2n, 2/2n,...,(2n-1)/2n, with each segment assumed to start with h=0. Although I emphasized that 0<= h <1 in the article, I overlooked the need for ensuring that in the C code, and apparently so did my colleagues. Sorry about that. This means that his code has to be changed slightly to ensure that 0 <= h < 1. Simard argues that this shouldn't mean anything because KS distribution is continuous, but if one wants to correct it, one should Instead of taking the floor(n*d + 1) and making this correction for h = 1, take the ceiling (n*d). I would prefer using ceiling (n*d) instead of the originally (wrongly) proposed floor(n*d + 1), despite arguments of continuity. So my plan is to do this (I still have my implementation which seem to work quite okay). The only problem is that R seems to use Marsaglia's code, and I don't have access to e.g. Mathematica which should implement several algorithms, so I might run into problems when I have to perform tests.
        Hide
        Richard Simard added a comment -

        If you used the same x in all 3 cases, I believe there is a bug in your exact and not exact codes because you get
        only 2 decimal digits of precision.

        Show
        Richard Simard added a comment - If you used the same x in all 3 cases, I believe there is a bug in your exact and not exact codes because you get only 2 decimal digits of precision.
        Hide
        Mikkel Meyer Andersen added a comment -

        Richard,
        Thanks for your knowledgeable comment. To quote you:

        "The argument x that you used in the Simard-L'écuyer program is not the same that you used for the other two programs."

        I'm not sure what you mean by that? Which argument should I then use to expect the same result?

        Show
        Mikkel Meyer Andersen added a comment - Richard, Thanks for your knowledgeable comment. To quote you: "The argument x that you used in the Simard-L'écuyer program is not the same that you used for the other two programs." I'm not sure what you mean by that? Which argument should I then use to expect the same result?
        Hide
        Richard Simard added a comment -

        http://www.mail-archive.com/issues@commons.apache.org/msg15829.html

        F(n, x) = F(200, 0.031111):
        Lecuyer (2.0 ms.) = 0.012916146481628863
        KolmogorovSmirnovDistribution exact (51902.0 ms.) = 0.012149763742041911
        KolmogorovSmirnovDistribution !exact (9.0 ms.) = 0.012149763742041922

        The argument x that you used in the Simard-L'écuyer program is not the same that you used for the other two programs. Of course you then get
        very different results. If I compute exactly in Mathematica, I obtain

        F(200, 0.031111) = 0.0129161464816289

        which is very different than your exact results above and agrees well with our program.

        =================================================
        Richard Simard <simardr@iro.umontreal.ca>
        Laboratoire de simulation et d'optimisation
        Université de Montréal, IRO

        Show
        Richard Simard added a comment - http://www.mail-archive.com/issues@commons.apache.org/msg15829.html F(n, x) = F(200, 0.031111): Lecuyer (2.0 ms.) = 0.012916146481628863 KolmogorovSmirnovDistribution exact (51902.0 ms.) = 0.012149763742041911 KolmogorovSmirnovDistribution !exact (9.0 ms.) = 0.012149763742041922 The argument x that you used in the Simard-L'écuyer program is not the same that you used for the other two programs. Of course you then get very different results. If I compute exactly in Mathematica, I obtain F(200, 0.031111) = 0.0129161464816289 which is very different than your exact results above and agrees well with our program. ================================================= Richard Simard <simardr@iro.umontreal.ca> Laboratoire de simulation et d'optimisation Université de Montréal, IRO
        Hide
        Mikkel Meyer Andersen added a comment -

        The last part of the roundedK kan be replaced with
        {{
        double pFrac = Hpower.getEntry(k - 2, k - 2);

        for (int i = 1; i <= n; ++i)

        { pFrac *= (double)i / (double)n; }

        return pFrac;
        }}
        to get even better running time and still precise results:
        {{
        F(n, x) = F(200, 0.02):
        Lecuyer (3.0 ms.) = 5.151982014280042E-6
        KolmogorovSmirnovDistribution exact (760.0 ms.) = 5.15198201428005E-6
        KolmogorovSmirnovDistribution !exact (16.0 ms.) = 5.151982014280049E-6
        -------------------------

        F(n, x) = F(200, 0.031111):
        Lecuyer (2.0 ms.) = 0.012916146481628863
        KolmogorovSmirnovDistribution exact (51902.0 ms.) = 0.012149763742041911
        KolmogorovSmirnovDistribution !exact (9.0 ms.) = 0.012149763742041922
        -------------------------

        F(n, x) = F(200, 0.04):
        Lecuyer (0.0 ms.) = 0.1067121882956352
        KolmogorovSmirnovDistribution exact (5903.0 ms.) = 0.10671370113626812
        KolmogorovSmirnovDistribution !exact (6.0 ms.) = 0.10671370113626813
        -------------------------
        }}

        Show
        Mikkel Meyer Andersen added a comment - The last part of the roundedK kan be replaced with {{ double pFrac = Hpower.getEntry(k - 2, k - 2); for (int i = 1; i <= n; ++i) { pFrac *= (double)i / (double)n; } return pFrac; }} to get even better running time and still precise results: {{ F(n, x) = F(200, 0.02): Lecuyer (3.0 ms.) = 5.151982014280042E-6 KolmogorovSmirnovDistribution exact (760.0 ms.) = 5.15198201428005E-6 KolmogorovSmirnovDistribution !exact (16.0 ms.) = 5.151982014280049E-6 ------------------------- F(n, x) = F(200, 0.031111): Lecuyer (2.0 ms.) = 0.012916146481628863 KolmogorovSmirnovDistribution exact (51902.0 ms.) = 0.012149763742041911 KolmogorovSmirnovDistribution !exact (9.0 ms.) = 0.012149763742041922 ------------------------- F(n, x) = F(200, 0.04): Lecuyer (0.0 ms.) = 0.1067121882956352 KolmogorovSmirnovDistribution exact (5903.0 ms.) = 0.10671370113626812 KolmogorovSmirnovDistribution !exact (6.0 ms.) = 0.10671370113626813 ------------------------- }}
        Hide
        Mikkel Meyer Andersen added a comment -

        An error corrected. Now the results are consistent with those of R and implementation by Simard and L'Ecuyer.

        The implementation can be made faster, but then n approx. larger than 140 cannot be dealt with.

        Right now two efficient power functions are implemented because no matrix interface dictating add/multiply etc. exists.

        Show
        Mikkel Meyer Andersen added a comment - An error corrected. Now the results are consistent with those of R and implementation by Simard and L'Ecuyer. The implementation can be made faster, but then n approx. larger than 140 cannot be dealt with. Right now two efficient power functions are implemented because no matrix interface dictating add/multiply etc. exists.
        Hide
        Mikkel Meyer Andersen added a comment -

        Implementation of evaluating distribution of two sided test statistics.

        Show
        Mikkel Meyer Andersen added a comment - Implementation of evaluating distribution of two sided test statistics.
        Hide
        Mikkel Meyer Andersen added a comment -

        Implementation of evaluating distribution of two sided test statistics.

        Show
        Mikkel Meyer Andersen added a comment - Implementation of evaluating distribution of two sided test statistics.

          People

          • Assignee:
            Phil Steitz
            Reporter:
            Mikkel Meyer Andersen
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Time Tracking

              Estimated:
              Original Estimate - 0.25h
              0.25h
              Remaining:
              Remaining Estimate - 0.25h
              0.25h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development