Details

Type: Bug

Status: Closed

Priority: Major

Resolution: Fixed

Affects Version/s: 2.0

Fix Version/s: 2.1

Labels:None

Environment:
Java 1.6 on Vista
Description
I have been comparing LoessInterpolator.smooth output with the loessFit output from R (Rproject.org, probably the most widely used loess implementation) and have had strangely different numbers. I have created a small set to test the difference and something seems to be wrong with the smooth method but I do no know what and I do not understand the code.
Example 1
xinput:  1.5  3.0  6  8  12  13  22  24  28  31 
yinput:  3.1  6.1  3.1  2.1  1.4  5.1  5.1  6.1  7.1  7.2 
Output LoessInterpolator.smooth():  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN 
Output from loessFit() from R:  3.191178027520974  3.0407201231474037  2.7089538903778636  2.7450823274490297  4.388011000549519  4.60078952381848  5.2988217587114805  5.867536388457898  6.7797794777879705  7.444888598397342 
Example 2 (same xvalues, yvalues just floored)
xinput:  1.5  3.0  6  8  12  13  22  24  28  31 
yinput:  3  6  3  2  1  5  5  6  7  7 
Output LoessInterpolator.smooth():  3  6  3  2  0.9999999999999005  5.0000000000001705  5  5.999999999999972  7  6.999999999999967 
Output from loessFit() from R:  3.091423927353068  2.9411521572524237  2.60967950675505  2.7421759322272248  4.382996912300442  4.646774316632562  5.225153658563424  5.768301917477015  6.637079139313073  7.270482144410326 
As you see the output is practically the replicated yinput.
At this point this funtionality is critical for us but I could not find any other suitable javaimplementation. Help. Maybe this strange behaviour gives someone a clue?

 math296test.patch
 2 kB
 Luc Maisonobe

 math296.patch
 13 kB
 Eugene Kirpichov

 MATH296.2.patch
 7 kB
 Eugene Kirpichov
Activity
 All
 Comments
 Work Log
 History
 Activity
 Transitions
My settings are 0.3 and 4 (these are used as defaults in R).
If I use the standard values I still get:
3.100000000000003  6.099999999999999  3.0999999999999996  NaN  NaN  NaN  NaN  NaN  7.1  7.85 
Here is my test code for a quick look:
public void testLoessInterpolator() throws MathException{
dataX = new double[10];
dataX[0] = 1.5;
dataX[1] = 3.0;
dataX[2] = 6;
dataX[3] = 8;
dataX[4] = 12;
dataX[5] = 13;
dataX[6] = 22;
dataX[7] = 24;
dataX[8] = 28;
dataX[9] = 31;
dataY = new double[10];
dataY[0] = 3.1;
dataY[1] = 6.1;
dataY[2] = 3.1;
dataY[3] = 2.1;
dataY[4] = 1.4;
dataY[5] = 5.1;
dataY[6] = 5.1;
dataY[7] = 6.1;
dataY[8] = 7.1;
dataY[9] = 7.2;
LoessInterpolator li = new LoessInterpolator();
double[] ly = li.smooth(dataX, dataY);
for (int i = 0; i < ly.length; i++)
{ System.out.println(ly[i]); }
}
Is there any obvious error which I do not see? Is it possible (although highely unlikely) that some java floating point operations are broken on Vista64 (which I am using)?
Can you show your results?
OK. I reproduce exactly your results on a Linux computer with openJDK 1.6 on an AMD64 processor.
I'll have a look at this issue.
The math296test.patch is a set of two test cases that reproduce the problem.
Running the first test shows that at the end of iteration 0, the weights for points at indices 4 and 5 are set to 0. At iteration 1, when computing point i=4, the bandwidth is such that ileft=3 and iright=4. For k=3, we get w=0 because dist=4 and denom=1/4. For k=4 and k=5, we get w=0 because robustness[k]=0. So all w are 0, sumWeights=0 and since it is used in some divisions, NaN appears.
I think their should be some protection against this case somewhere. I'll ask the author of the original contribution about this.
Thanks for reporting the problem, Romeo. Actually, I did not compare the results of my implementation with those of R's: I purely implemented the algorithm described in the paper, and checked that the results are sensible. Unfortunately I will most probably not have time to debug the issue until weekend, but I can provide explanations to you about the code in case the issue is so critical for you that you urge to fix it yourself
Eugene, that is great news. I already tried to fix it but did not get very far.
First of all, please ignore the sample values that are supposed to be from loessFit in the example above, they are not. I have reviewed my code and I can not reproduce those values. Sorry for that. When I try the examples again I get hardly any changes except for the NaN values (perhaps due to the fact that the values are not very close together?). The testcases should be deleted/changed to reflect this I have compiled a more appropriate testcase:
xval  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1.0  1.1  1.2  1.3  1.4  1.5  1.6  1.7  1.8  1.9  2.0 
yval  0.47  0.48  0.55  0.56  0.08  0.04  0.07  0.07  0.56  0.46  0.56  0.52  3.03  3.08  3.09  3.04  3.54  3.46  3.36  3.35 
result  0.46076343001050907  0.49997135715509317  0.5429801876574236  0.3191000440696382  0.17824135321363765  0.03999999999999973  0.0699999999999999  0.21301261447145203  0.3260295715150817  0.45999999999999996  0.560000000000002  1.4961752602921834  2.1317826106042093  3.0799999999999965  3.089999999999987  0.6229499113916326  0.9970283360414918  3.450011252797152  3.3907474604481567  3.336774229999947 
loessFit  0.4609008580907899  0.4985267031275769  0.5408291873645029  0.308077939113221  0.1754074466314782  0.0419929152622151  0.07205683451999362  0.19661654144034335  0.31073415047540565  0.44615854234427305  0.5567051673879394  1.4972970376332615  2.1330233520442317  3.08  3.0900000000000043  0.6208045742447532  0.9823842010251099  3.449395357814414  3.389793487936696  3.3359749840089385 
weight(see below)  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  0.0  0.0  1.0  1.0  0.0  0.0  1.0  1.0 
loessFit with weights  0.4780117093745368  0.4924876444111407  0.48398021996388496  0.3201446090891406  0.17880078999161195  0.003305611690841502  0.08829472923805592  0.20902549447499932  0.3267558674057103  0.45542226638134214  0.518369970805548  0.5365908808969353  1.492628917733731  2.1151751796193703  3.09  3.0400000000000005  2.9899999999999998  0.15500000000000005  1.7524999999999986  3.3499999999999996 
The results do not differ very much and there are no NaN, if the values are close together. Can this be explained by a slightly different algorithm in R?
For easy copy and paste:
double[] yval=
;
double[] xval=
;
double[] weights =
;
LoessInterpolator li = new LoessInterpolator(0.3,4);
result = li.smooth(xval, yval);
Another related thing. A very common use of the Rfunction is with weights (that is how we use it as well):
loessFit(yValues, xValues, weights)
where weights are the values 0 or 1, I dont know if some are using weights between 0 and 1
If a spot has weight 0 it still gets loessfitted but does not affect his neighbours. This way you can specifiy more relevant points and have the complete data set interpolate around these.
In your implementation you use robustnessWeights and set them initially to 1. Do I understand this correctly that these are actually the same weights as the ones mentioned above?
Would it be possible to add a function with a third parameter and have the weights provided by the user? I tried just passing them trough as a parameter instead of initializing with 1 but it did not do the trick.
Romeo,
Here is a patch that implements the weighted loess fit (the results agree reasonably with R's implementation).
Also, the patch fixes some numeric instability by introducing an 'accuracy' parameter.
Luc,
Could you please review the patch and probably apply it?
solved in subversion repository as of r818942.
patch applied with slight changes
thanks for the report and thanks for the patch
Eugene, Luc, thanks. Great news. I have been testing this with some bigger reallife datasets (~800 points) and the results seem to be very close to R).
There is one more problem though but I think I have already found the fix. You can still get NaNresults.
The reason is that sumWeights can become 0 and there are a couple of divisions with sumWeights. This multiplies to the degree that the complete result set is an array of NaN.
I think (but am not sure) that this happens if you have a long series of 0weights which is bigger then the amount of points you get through the bandwidth.
I have just added the following lines after these sumWeights divisions:
after: double meanXSquared = sumXSquared / sumWeights;
if(sumWeights==0)
{ meanX=0; meanY=0; meanXY=0; meanXSquared=0; }That did the trick for me and the results seem accurate. I will test more during the next week with even more datasets. I hope I do not make any fundamental thinking error.
If you think this is correct please add it to the code.
Romeo, I am not sure about whether this behavior should be considered correct.
It essentially means that there is no data inside an approximation window at all, so why should we make the data appear to be zero?
How about a variant "express the approximation window in terms of the number of points with nonzero weights"?
Sorry, maybe I was a little unclear in my explanation.
Actually there is data, there are just no weights in the approximation window. Lets say we have 300 point and the first 100 are weighted 0, the others are weighted 1, they are still valid data points. Being weighted 0 does not mean the data point can be deleted, it just means it does not affect his neighbours in terms of normalization (that is how the concept of normalization weights was explained to me)
What happens currently is the first ones become NaN and their neighbours become NaN to the point that all points become NaN. This shouldn't happen.
But do not misunderstand me, I am just assuming that the reason is because of the bandwidth as my dataset has this structure, I am not really sure about it.
I have done some debugging on the code but there are really a lot of loops so I could not pinpoint when and why exactly this is starting to happen as it is not right from the beginning.
Romeo, have I understood you correctly that you'd like the data points with zero weights to not influence the approximation coefficients at all, but to still have computed approximation values for them?
Yes. I think that is correct. No influence on the approximation coefficients but still getting normalized (getting approximation values).
Here is an except from the userguide for the Rlimma package, which uses the loessFit function for some operations:
Any spot quality weights found in RG will be used in the normalization by default. This means for example that spots with zero weight (flagged out) will not influence the normalization of other spots. The use of spot quality weights will not however result in any spots being removed from the data object. Even spots with zero weight will be normalized and will appear in the output object, such spots will simply not have any influence on the other spots.
However I am not sure how to handle the bandwidth. Example: if you have a bandwidth of 0.3, how do you compute the relevant points?
1. Find the 30% in the complete set and the use only the weighted ones inside this set
2. Look at all weighted ones and use 30% of them
To me the first one sounds like the logical one but I am not sure.
Sorry for the delay again: I just forgot about the issue. Don't hesitate to ping me, I am not a very organized person.
Adding zeroweighted points must not change the result in nonzeroweighted points and the approximation coefficients.
Thus, zeroweighted points must not participate in computing the bandwidth, and generally in computing anything at all: that is the second approach.
However, seems like R takes the first one, because I've implemented the second one and results disagreed with R on the testcase above.
Are you OK with having a more logical implementation that disagrees with R?
Eugene's reasoning sounds correct to me. I am inclined to say we document carefully and resolve this issue. Any objections?
I agree, but would like to know what R produces when a large (read larger than bandwidth) chunk of zero weights occurs.
Coming back to this issue.
Eugene, as you said you implemented the second approach and people agree with your reasoning,
could you provide a patch using this second approach so we can apply it ?
Here is the patch.
second patch applied as of r903440
thanks for the patch
What are your settings for bandwidth and robustness iterations ?
I have tried your examples and the first one doesn't lead to NaN with the default settings for me.