Details

Type: New Feature

Status: Reopened

Priority: Minor

Resolution: Unresolved

Affects Version/s: None

Fix Version/s: 4.0

Labels:None
Description
KolmogorovSmirnov test (see [1]) is used to test if one sample against a known probability density functions or if two samples are from the same distribution. To evaluate the test statistic, the KolmogorovSmirnov distribution is used. Quite good asymptotics exist for the onesided test, but it's more difficult for the twosided test.
[1]: http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test

 ksdistribution.patch
 9 kB
 Luc Maisonobe

 MATH437withtesttake1
 27 kB
 Mikkel Meyer Andersen
Activity
Sure, we can wait for this issue to be fixed before releasing 3.3 if you think it can be done soon.
I have the first part of this  the new class and deprecation of the old  just about ready to commit. I would like to slide this in 3.3 if I can have to the end of this week to finish it.
Postponing to 4.0, as per previous comments.
+1 to postpone to 4.0.
I have no opinion about the usefulness of the distribution itself, so if you think it would be worth removing this part and having only the test, this would be fine with me.
I think we should bump this to 4.0 or at least 3.3. It was probably a mistake to put KS in the distribution package. The KS distribution itself is of little practical usefulness (to my knowledge at least). I have never seen it used for anything but performing KS tests. It is tricky enough to compute the distribution function itself with any kind of numerical stability, as the comments above and the literature around KS tests confirm. Computing moments is, as the reference where Luc (resourcefully!) found test data states, "intractable." I think it may be best to steer clear of this and focus on just getting good implementation of the test itself, which should move to .inference. I would prefer to do a little more research though to decide how best to set up the API and implementation for the test. It could be we would be better off not using the cdfs in the current impl, instead using beta approximation to compute pvalues as in [1]. Note also that since discussion above / initial implementation, Simard has published [2] with some empirical findings on how the various KS approximation methods perform.
So to summarize, I think the first step is to agree on the KS test API. Then deprecate the class in .distribution and move the test class to .inference.
[1]http://www.ism.ac.jp/editsec/aism/pdf/054_3_0577.pdf
[2] http://www.jstatsoft.org/v39/i11/paper
Pushing this to top of my [math] list. The patch makes great use of our numerics but I have to think there are better ways to compute these things. Also, we need to separate the test class, which should go into .stat.inference from the distribution class. If this is the last bug holding up 3.2, feel free to bump to 4.0 or 3.3.
Would the ksdistribution.patch patch be an acceptable way to implement the real distribution interface and solve the issue?
Note that the implementations is really basic numerical computation using the cumulative probability as the reference. There are surely much better ways to do that. In particular, there seems to be exact formulas to compute the first moments (see the comments in the code and in the test case).
I did not set up the test case to implement the abstract test shared by all distributions, but it would need to be done too.
If the patch is not acceptable, could we push this issue to 4.0?
The implementation class is in the distribution package, but does not implement a distribution interface. We should either implement the missing methods or move this class (probably renamed) to .stat.inference. In either case, we should implement methods making it easy to set up and execute KS tests.
Fixed in revision 1083716. Implclass now mentions
{@link KolmogorovSmirnovDistribution}, if that was what you meant, Phil? Power functionality from MATH435 now used and the private methods removed. Note that  as before  the tests are made using src/test/R/KolmogorovSmirnovDistributionTestCases.R.
+1 to commit as is, adding some algorithm notes to the class javadoc and the MATH435 power impl. I am ambivalent on whether or not to "fix" the error in Marsaglia's code that is apparently included in R. Having the verification tests is good, though, so I would leave as is in the patch, since the Marsaglia C impl can be seen as a reference in this case. I can see the other side of the argument here, though and would be fine with just going with the fixed code, suitably documented. What do others think about this?
It looks like you forgot to add the references to the class javadoc for the impl class.
Per comment on MATH435, I think we should add the matrix power impl there and use it here.
A proposal based on the method mainly by Marsaglia et al. optimised for extreme values as described by Simard et al. This method delivers consistent results comparing to R on the test cases tried.
In the past months, I've communicated with both Richard Simard and George Marsaglia regarding small disagreement between theory in Marsaglia's article and the actual implementation; namely the fact that 0 <= h < 1, but in the code 0 < h <= 1. I wrote to Marsaglia regarding this, and his answer was:
The Kolmogorov distribution comes from a piecewise polynomial in h with knots at 1/2n, 2/2n,...,(2n1)/2n, with each segment assumed to start with h=0. Although I emphasized that 0<= h <1 in the article, I overlooked the need for ensuring that in the C code, and apparently so did my colleagues. Sorry about that.
This means that his code has to be changed slightly to ensure that 0 <= h < 1. Simard argues that this shouldn't mean anything because KS distribution is continuous, but if one wants to correct it, one should
Instead of taking the floor(n*d + 1) and making this correction for h = 1, take the ceiling (n*d).
I would prefer using ceiling (n*d) instead of the originally (wrongly) proposed floor(n*d + 1), despite arguments of continuity. So my plan is to do this (I still have my implementation which seem to work quite okay). The only problem is that R seems to use Marsaglia's code, and I don't have access to e.g. Mathematica which should implement several algorithms, so I might run into problems when I have to perform tests.
If you used the same x in all 3 cases, I believe there is a bug in your exact and not exact codes because you get
only 2 decimal digits of precision.
Richard,
Thanks for your knowledgeable comment. To quote you:
"The argument x that you used in the SimardL'écuyer program is not the same that you used for the other two programs."
I'm not sure what you mean by that? Which argument should I then use to expect the same result?
http://www.mailarchive.com/issues@commons.apache.org/msg15829.html
F(n, x) = F(200, 0.031111):
Lecuyer (2.0 ms.) = 0.012916146481628863
KolmogorovSmirnovDistribution exact (51902.0 ms.) = 0.012149763742041911
KolmogorovSmirnovDistribution !exact (9.0 ms.) = 0.012149763742041922
The argument x that you used in the SimardL'écuyer program is not the same that you used for the other two programs. Of course you then get
very different results. If I compute exactly in Mathematica, I obtain
F(200, 0.031111) = 0.0129161464816289
which is very different than your exact results above and agrees well with our program.
=================================================
Richard Simard <simardr@iro.umontreal.ca>
Laboratoire de simulation et d'optimisation
Université de Montréal, IRO
The last part of the roundedK kan be replaced with
{{
double pFrac = Hpower.getEntry(k  2, k  2);
for (int i = 1; i <= n; ++i)
{ pFrac *= (double)i / (double)n; }return pFrac;
}}
to get even better running time and still precise results:
{{
F(n, x) = F(200, 0.02):
Lecuyer (3.0 ms.) = 5.151982014280042E6
KolmogorovSmirnovDistribution exact (760.0 ms.) = 5.15198201428005E6
KolmogorovSmirnovDistribution !exact (16.0 ms.) = 5.151982014280049E6

F(n, x) = F(200, 0.031111):
Lecuyer (2.0 ms.) = 0.012916146481628863
KolmogorovSmirnovDistribution exact (51902.0 ms.) = 0.012149763742041911
KolmogorovSmirnovDistribution !exact (9.0 ms.) = 0.012149763742041922

F(n, x) = F(200, 0.04):
Lecuyer (0.0 ms.) = 0.1067121882956352
KolmogorovSmirnovDistribution exact (5903.0 ms.) = 0.10671370113626812
KolmogorovSmirnovDistribution !exact (6.0 ms.) = 0.10671370113626813

}}
An error corrected. Now the results are consistent with those of R and implementation by Simard and L'Ecuyer.
The implementation can be made faster, but then n approx. larger than 140 cannot be dealt with.
Right now two efficient power functions are implemented because no matrix interface dictating add/multiply etc. exists.
Implementation of evaluating distribution of two sided test statistics.
Implementation of evaluating distribution of two sided test statistics.
First cut added in r1572335.
I don't want to hold up 3.3 release for this, but pre or post3.3 still need to:
1. Update user guide
2. Add static methods to TestUtils