[STATISTICS-65] Reimplement the Kolmogorov-Smirnov Test - ASF JIRA

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Implemented
Affects Version/s: None
Fix Version/s: 1.1
Component/s: inference
Labels:
None

difficulty-level:
Easy

Description

The inference module contains the Kolmogorov-Smirnov (KS) test. The two-sample test computes the D statistic as the supremum (effectively the maximum) difference between the empirical CDF of two samples:

D  = sup | F(x) - G(x) |  = max(D+, D-)
D+ = sup   F(x) - G(x)
D- = sup   G(x) - F(x)

Two-sided tests use D, one sided use D+ (greater than) or D- (less than).

The implementation in CM contains computation of the D value and p-values for D. Computation of D involves sorting the combined dataset of the two-samples (x_n and y_m) and moving D down 1/n if x is observed, otherwise moving up by 1/m if y is observed. The resulting D is the maximum deviation from 0, with the result in [0, 1]. (D+ is above 0; D- is below 0). This procedure does not work in the presence of ties as you do not know whether to adjust up or down, so the entire tied region is skipped and the cumulative effect on D is used.

The computation of the p-value is adjusted based on the presence of ties in the data. Currently if ties are present the exact p-value is noted to be invalid. The method then randomly adjusts the input values to eliminate ties so that a D value can be created and the exact p-value computed for the random sample D of all possible D. This behaviour occurs without a user choice. Adjustment of data uses small steps (a few ULP) in order to minimise any change to the ranking of x and y. But depending on the data a change to the ranking can occur, and it can occur outside the tied region as all data is subject to random jitter, not just the tied values. Effectively the method is randomly adjusting the input data and hoping that it will not change non-tied regions just to generate a random path through the tied region and a single sample for D.

Note the following observations:

Ties do not make the p-value computation incorrect; they make the D value computation a single sample of possible D values. The p-value computation is still exact but it is using an input D with a value that is potentially lower than it could be, depending on how ties are resolved.
Ties within the same sample will not change the result
Ties between the two samples may change the result
The distribution of D is discrete
The distribution of D in the absence of ties has PMF(D=0) = 0. If there are no ties you either go up or down. This will move away from 0 at least 1 step.

What is missing from the implementation is a detection of whether the ties matter. Any tied region can be traversed a number of different ways. If none of these possible combinations change the D value then the tied region is located where it does not matter, e.g.

[             4, 5, 6, 7, 8]
[ 0, 1, 2, 3,       6      ]

This square data n=m takes 4x, 2y before the single tie then 2y. If the tie resolves as (1x, 1y) or (1y, 1x) then the maximum D is unchanged.

A region of a and b ties for x and y will have binom(a+b, b) possible combinations. However the largest change to the D value is the two most extreme paths (ax, by) or (by, ax). Since the entire tied region is traversed in a single step this is trivial to compute as the number of steps in x and y is known. The implementation must then only track the D value when tied regions are skipped (current behaviour), or the most extreme D value through any tied regions. If the resulting max D is the same then none of the tied regions has an effect on D then the D value is unique among all possible D and the p-value is exact. If it is possible to traverse the tied regions and create a more extreme D then the p-value is not exact.

I suggest updating the implementation to:

Compute D
Compute the most extreme D value generated through tied regions
Return a result that indicates if the p-value is an estimate. Optionally the most extreme D statistic can be returned.

Note that the most extreme D value is informative if for example it is much larger than D. This will occur when the tied regions are long. The most extreme case is input of matching data where D=0 but the extreme D value is 1.

In the case where the tied regions can change the D value there are at least two options:

Use the bootstrap sampling method in the current CM implementation. This generates n random samples (sx_n and sy_m) from the combined distribution of x and y and D values for each. The p-value is the number of times the sample D value for sx_n and sy_m was larger than the D for x and y. This only applies if the combined sample is representative of the shared distribution so is for large sample sizes.
Walk n random paths through tied regions to create a distribution for D. Generate the average p value for all D in the sample. Note that if the number of the paths though the tied regions is small it is possible to enumerate them. The combinations of binom(a+b, b) can be efficiently enumerated using the Combinations class from the Numbers combinatorics module.

The option to generate an estimate for P using a sampled distribution for D should not be performed in the main KS test routine.

Attachments

Issue Links

is a child of

STATISTICS-62 Port o.a.c.math.stat.inference to a commons-statistics-inference module

Closed

Reimplement the Kolmogorov-Smirnov Test

Details

Description

Attachments

Issue Links

Activity

People

Dates