Commons Math
  1. Commons Math
  2. MATH-296

LoessInterpolator.smooth() not working correctly

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.0
    • Fix Version/s: 2.1
    • Labels:
      None
    • Environment:

      Java 1.6 on Vista

      Description

      I have been comparing LoessInterpolator.smooth output with the loessFit output from R (R-project.org, probably the most widely used loess implementation) and have had strangely different numbers. I have created a small set to test the difference and something seems to be wrong with the smooth method but I do no know what and I do not understand the code.
      Example 1

      x-input: 1.5 3.0 6 8 12 13 22 24 28 31
      y-input: 3.1 6.1 3.1 2.1 1.4 5.1 5.1 6.1 7.1 7.2
      Output LoessInterpolator.smooth(): NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
      Output from loessFit() from R: 3.191178027520974 3.0407201231474037 2.7089538903778636 2.7450823274490297 4.388011000549519 4.60078952381848 5.2988217587114805 5.867536388457898 6.7797794777879705 7.444888598397342

      Example 2 (same x-values, y-values just floored)

      x-input: 1.5 3.0 6 8 12 13 22 24 28 31
      y-input: 3 6 3 2 1 5 5 6 7 7
      Output LoessInterpolator.smooth(): 3 6 3 2 0.9999999999999005 5.0000000000001705 5 5.999999999999972 7 6.999999999999967
      Output from loessFit() from R: 3.091423927353068 2.9411521572524237 2.60967950675505 2.7421759322272248 4.382996912300442 4.646774316632562 5.225153658563424 5.768301917477015 6.637079139313073 7.270482144410326

      As you see the output is practically the replicated y-input.
      At this point this funtionality is critical for us but I could not find any other suitable java-implementation. Help. Maybe this strange behaviour gives someone a clue?

      1. math-296-test.patch
        2 kB
        Luc Maisonobe
      2. math-296.patch
        13 kB
        Eugene Kirpichov
      3. MATH-296.2.patch
        7 kB
        Eugene Kirpichov

        Activity

        Hide
        Luc Maisonobe added a comment -

        second patch applied as of r903440
        thanks for the patch

        Show
        Luc Maisonobe added a comment - second patch applied as of r903440 thanks for the patch
        Hide
        Eugene Kirpichov added a comment -

        Here is the patch.

        Show
        Eugene Kirpichov added a comment - Here is the patch.
        Hide
        Luc Maisonobe added a comment -

        Coming back to this issue.
        Eugene, as you said you implemented the second approach and people agree with your reasoning,
        could you provide a patch using this second approach so we can apply it ?

        Show
        Luc Maisonobe added a comment - Coming back to this issue. Eugene, as you said you implemented the second approach and people agree with your reasoning, could you provide a patch using this second approach so we can apply it ?
        Hide
        Luc Maisonobe added a comment -

        I agree, but would like to know what R produces when a large (read larger than bandwidth) chunk of zero weights occurs.

        Show
        Luc Maisonobe added a comment - I agree, but would like to know what R produces when a large (read larger than bandwidth) chunk of zero weights occurs.
        Hide
        Phil Steitz added a comment -

        Eugene's reasoning sounds correct to me. I am inclined to say we document carefully and resolve this issue. Any objections?

        Show
        Phil Steitz added a comment - Eugene's reasoning sounds correct to me. I am inclined to say we document carefully and resolve this issue. Any objections?
        Hide
        Eugene Kirpichov added a comment -

        Sorry for the delay again: I just forgot about the issue. Don't hesitate to ping me, I am not a very organized person.

        Adding zero-weighted points must not change the result in non-zero-weighted points and the approximation coefficients.
        Thus, zero-weighted points must not participate in computing the bandwidth, and generally in computing anything at all: that is the second approach.

        However, seems like R takes the first one, because I've implemented the second one and results disagreed with R on the testcase above.

        Are you OK with having a more logical implementation that disagrees with R?

        Show
        Eugene Kirpichov added a comment - Sorry for the delay again: I just forgot about the issue. Don't hesitate to ping me, I am not a very organized person. Adding zero-weighted points must not change the result in non-zero-weighted points and the approximation coefficients. Thus, zero-weighted points must not participate in computing the bandwidth, and generally in computing anything at all: that is the second approach. However, seems like R takes the first one, because I've implemented the second one and results disagreed with R on the testcase above. Are you OK with having a more logical implementation that disagrees with R?
        Hide
        Romeo P added a comment - - edited

        Yes. I think that is correct. No influence on the approximation coefficients but still getting normalized (getting approximation values).

        Here is an except from the userguide for the R-limma package, which uses the loessFit function for some operations:

        Any spot quality weights found in RG will be used in the normalization by default. This means for example that spots with zero weight (flagged out) will not influence the normalization of other spots. The use of spot quality weights will not however result in any spots being removed from the data object. Even spots with zero weight will be normalized and will appear in the output object, such spots will simply not have any influence on the other spots.

        However I am not sure how to handle the bandwidth. Example: if you have a bandwidth of 0.3, how do you compute the relevant points?
        1. Find the 30% in the complete set and the use only the weighted ones inside this set
        2. Look at all weighted ones and use 30% of them

        To me the first one sounds like the logical one but I am not sure.

        Show
        Romeo P added a comment - - edited Yes. I think that is correct. No influence on the approximation coefficients but still getting normalized (getting approximation values). Here is an except from the userguide for the R-limma package, which uses the loessFit function for some operations: Any spot quality weights found in RG will be used in the normalization by default. This means for example that spots with zero weight (flagged out) will not influence the normalization of other spots. The use of spot quality weights will not however result in any spots being removed from the data object. Even spots with zero weight will be normalized and will appear in the output object, such spots will simply not have any influence on the other spots. However I am not sure how to handle the bandwidth. Example: if you have a bandwidth of 0.3, how do you compute the relevant points? 1. Find the 30% in the complete set and the use only the weighted ones inside this set 2. Look at all weighted ones and use 30% of them To me the first one sounds like the logical one but I am not sure.
        Hide
        Eugene Kirpichov added a comment -

        Romeo, have I understood you correctly that you'd like the data points with zero weights to not influence the approximation coefficients at all, but to still have computed approximation values for them?

        Show
        Eugene Kirpichov added a comment - Romeo, have I understood you correctly that you'd like the data points with zero weights to not influence the approximation coefficients at all, but to still have computed approximation values for them?
        Hide
        Romeo P added a comment - - edited

        Sorry, maybe I was a little unclear in my explanation.
        Actually there is data, there are just no weights in the approximation window. Lets say we have 300 point and the first 100 are weighted 0, the others are weighted 1, they are still valid data points. Being weighted 0 does not mean the data point can be deleted, it just means it does not affect his neighbours in terms of normalization (that is how the concept of normalization weights was explained to me)
        What happens currently is the first ones become NaN and their neighbours become NaN to the point that all points become NaN. This shouldn't happen.
        But do not misunderstand me, I am just assuming that the reason is because of the bandwidth as my dataset has this structure, I am not really sure about it.
        I have done some debugging on the code but there are really a lot of loops so I could not pinpoint when and why exactly this is starting to happen as it is not right from the beginning.

        Show
        Romeo P added a comment - - edited Sorry, maybe I was a little unclear in my explanation. Actually there is data, there are just no weights in the approximation window. Lets say we have 300 point and the first 100 are weighted 0, the others are weighted 1, they are still valid data points. Being weighted 0 does not mean the data point can be deleted, it just means it does not affect his neighbours in terms of normalization (that is how the concept of normalization weights was explained to me) What happens currently is the first ones become NaN and their neighbours become NaN to the point that all points become NaN. This shouldn't happen. But do not misunderstand me, I am just assuming that the reason is because of the bandwidth as my dataset has this structure, I am not really sure about it. I have done some debugging on the code but there are really a lot of loops so I could not pinpoint when and why exactly this is starting to happen as it is not right from the beginning.
        Hide
        Eugene Kirpichov added a comment -

        Romeo, I am not sure about whether this behavior should be considered correct.
        It essentially means that there is no data inside an approximation window at all, so why should we make the data appear to be zero?
        How about a variant "express the approximation window in terms of the number of points with non-zero weights"?

        Show
        Eugene Kirpichov added a comment - Romeo, I am not sure about whether this behavior should be considered correct. It essentially means that there is no data inside an approximation window at all, so why should we make the data appear to be zero? How about a variant "express the approximation window in terms of the number of points with non-zero weights"?
        Hide
        Romeo P added a comment -

        Eugene, Luc, thanks. Great news. I have been testing this with some bigger real-life datasets (~800 points) and the results seem to be very close to R).
        There is one more problem though but I think I have already found the fix. You can still get NaN-results.
        The reason is that sumWeights can become 0 and there are a couple of divisions with sumWeights. This multiplies to the degree that the complete result set is an array of NaN.
        I think (but am not sure) that this happens if you have a long series of 0-weights which is bigger then the amount of points you get through the bandwidth.

        I have just added the following lines after these sumWeights divisions:
        after: double meanXSquared = sumXSquared / sumWeights;

        if(sumWeights==0)

        { meanX=0; meanY=0; meanXY=0; meanXSquared=0; }

        That did the trick for me and the results seem accurate. I will test more during the next week with even more datasets. I hope I do not make any fundamental thinking error.

        If you think this is correct please add it to the code.

        Show
        Romeo P added a comment - Eugene, Luc, thanks. Great news. I have been testing this with some bigger real-life datasets (~800 points) and the results seem to be very close to R). There is one more problem though but I think I have already found the fix. You can still get NaN-results. The reason is that sumWeights can become 0 and there are a couple of divisions with sumWeights. This multiplies to the degree that the complete result set is an array of NaN. I think (but am not sure) that this happens if you have a long series of 0-weights which is bigger then the amount of points you get through the bandwidth. I have just added the following lines after these sumWeights divisions: after: double meanXSquared = sumXSquared / sumWeights; if(sumWeights==0) { meanX=0; meanY=0; meanXY=0; meanXSquared=0; } That did the trick for me and the results seem accurate. I will test more during the next week with even more datasets. I hope I do not make any fundamental thinking error. If you think this is correct please add it to the code.
        Hide
        Luc Maisonobe added a comment -

        solved in subversion repository as of r818942.
        patch applied with slight changes
        thanks for the report and thanks for the patch

        Show
        Luc Maisonobe added a comment - solved in subversion repository as of r818942. patch applied with slight changes thanks for the report and thanks for the patch
        Hide
        Eugene Kirpichov added a comment -

        Romeo,
        Here is a patch that implements the weighted loess fit (the results agree reasonably with R's implementation).
        Also, the patch fixes some numeric instability by introducing an 'accuracy' parameter.

        Luc,
        Could you please review the patch and probably apply it?

        Show
        Eugene Kirpichov added a comment - Romeo, Here is a patch that implements the weighted loess fit (the results agree reasonably with R's implementation). Also, the patch fixes some numeric instability by introducing an 'accuracy' parameter. Luc, Could you please review the patch and probably apply it?
        Hide
        Romeo P added a comment -

        Eugene, that is great news. I already tried to fix it but did not get very far.
        First of all, please ignore the sample values that are supposed to be from loessFit in the example above, they are not. I have reviewed my code and I can not reproduce those values. Sorry for that. When I try the examples again I get hardly any changes except for the NaN values (perhaps due to the fact that the values are not very close together?). The testcases should be deleted/changed to reflect this I have compiled a more appropriate testcase:

        xval 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
        yval 0.47 0.48 0.55 0.56 -0.08 -0.04 -0.07 -0.07 -0.56 -0.46 -0.56 -0.52 -3.03 -3.08 -3.09 -3.04 3.54 3.46 3.36 3.35
        result 0.46076343001050907 0.49997135715509317 0.5429801876574236 0.3191000440696382 0.17824135321363765 -0.03999999999999973 -0.0699999999999999 -0.21301261447145203 -0.3260295715150817 -0.45999999999999996 -0.560000000000002 -1.4961752602921834 -2.1317826106042093 -3.0799999999999965 -3.089999999999987 -0.6229499113916326 0.9970283360414918 3.450011252797152 3.3907474604481567 3.336774229999947
        loessFit 0.4609008580907899 0.4985267031275769 0.5408291873645029 0.308077939113221 0.1754074466314782 -0.0419929152622151 -0.07205683451999362 -0.19661654144034335 -0.31073415047540565 -0.44615854234427305 -0.5567051673879394 -1.4972970376332615 -2.1330233520442317 -3.08 -3.0900000000000043 -0.6208045742447532 0.9823842010251099 3.449395357814414 3.389793487936696 3.3359749840089385
        weight(see below) 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0
        loessFit with weights 0.4780117093745368 0.4924876444111407 0.48398021996388496 0.3201446090891406 0.17880078999161195 -0.003305611690841502 -0.08829472923805592 -0.20902549447499932 -0.3267558674057103 -0.45542226638134214 -0.518369970805548 -0.5365908808969353 -1.492628917733731 -2.1151751796193703 -3.09 -3.0400000000000005 -2.9899999999999998 0.15500000000000005 1.7524999999999986 3.3499999999999996

        The results do not differ very much and there are no NaN, if the values are close together. Can this be explained by a slightly different algorithm in R?

        For easy copy and paste:
        double[] yval=

        {0.47, 0.48, 0.55, 0.56,-0.08,-0.04,-0.07,-0.07,-0.56,-0.46,-0.56,-0.52,-3.03,-3.08,-3.09,-3.04, 3.54, 3.46, 3.36, 3.35}

        ;
        double[] xval=

        {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0}

        ;
        double[] weights =

        {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1}

        ;
        LoessInterpolator li = new LoessInterpolator(0.3,4);
        result = li.smooth(xval, yval);

        Another related thing. A very common use of the R-function is with weights (that is how we use it as well):
        loessFit(yValues, xValues, weights)
        where weights are the values 0 or 1, I dont know if some are using weights between 0 and 1
        If a spot has weight 0 it still gets loessfitted but does not affect his neighbours. This way you can specifiy more relevant points and have the complete data set interpolate around these.

        In your implementation you use robustnessWeights and set them initially to 1. Do I understand this correctly that these are actually the same weights as the ones mentioned above?
        Would it be possible to add a function with a third parameter and have the weights provided by the user? I tried just passing them trough as a parameter instead of initializing with 1 but it did not do the trick.

        Show
        Romeo P added a comment - Eugene, that is great news. I already tried to fix it but did not get very far. First of all, please ignore the sample values that are supposed to be from loessFit in the example above, they are not. I have reviewed my code and I can not reproduce those values. Sorry for that. When I try the examples again I get hardly any changes except for the NaN values (perhaps due to the fact that the values are not very close together?). The testcases should be deleted/changed to reflect this I have compiled a more appropriate testcase: xval 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 yval 0.47 0.48 0.55 0.56 -0.08 -0.04 -0.07 -0.07 -0.56 -0.46 -0.56 -0.52 -3.03 -3.08 -3.09 -3.04 3.54 3.46 3.36 3.35 result 0.46076343001050907 0.49997135715509317 0.5429801876574236 0.3191000440696382 0.17824135321363765 -0.03999999999999973 -0.0699999999999999 -0.21301261447145203 -0.3260295715150817 -0.45999999999999996 -0.560000000000002 -1.4961752602921834 -2.1317826106042093 -3.0799999999999965 -3.089999999999987 -0.6229499113916326 0.9970283360414918 3.450011252797152 3.3907474604481567 3.336774229999947 loessFit 0.4609008580907899 0.4985267031275769 0.5408291873645029 0.308077939113221 0.1754074466314782 -0.0419929152622151 -0.07205683451999362 -0.19661654144034335 -0.31073415047540565 -0.44615854234427305 -0.5567051673879394 -1.4972970376332615 -2.1330233520442317 -3.08 -3.0900000000000043 -0.6208045742447532 0.9823842010251099 3.449395357814414 3.389793487936696 3.3359749840089385 weight(see below) 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 1.0 loessFit with weights 0.4780117093745368 0.4924876444111407 0.48398021996388496 0.3201446090891406 0.17880078999161195 -0.003305611690841502 -0.08829472923805592 -0.20902549447499932 -0.3267558674057103 -0.45542226638134214 -0.518369970805548 -0.5365908808969353 -1.492628917733731 -2.1151751796193703 -3.09 -3.0400000000000005 -2.9899999999999998 0.15500000000000005 1.7524999999999986 3.3499999999999996 The results do not differ very much and there are no NaN, if the values are close together. Can this be explained by a slightly different algorithm in R? For easy copy and paste: double[] yval= {0.47, 0.48, 0.55, 0.56,-0.08,-0.04,-0.07,-0.07,-0.56,-0.46,-0.56,-0.52,-3.03,-3.08,-3.09,-3.04, 3.54, 3.46, 3.36, 3.35} ; double[] xval= {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0} ; double[] weights = {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1} ; LoessInterpolator li = new LoessInterpolator(0.3,4); result = li.smooth(xval, yval); Another related thing. A very common use of the R-function is with weights (that is how we use it as well): loessFit(yValues, xValues, weights) where weights are the values 0 or 1, I dont know if some are using weights between 0 and 1 If a spot has weight 0 it still gets loessfitted but does not affect his neighbours. This way you can specifiy more relevant points and have the complete data set interpolate around these. In your implementation you use robustnessWeights and set them initially to 1. Do I understand this correctly that these are actually the same weights as the ones mentioned above? Would it be possible to add a function with a third parameter and have the weights provided by the user? I tried just passing them trough as a parameter instead of initializing with 1 but it did not do the trick.
        Hide
        Eugene Kirpichov added a comment -

        Thanks for reporting the problem, Romeo. Actually, I did not compare the results of my implementation with those of R's: I purely implemented the algorithm described in the paper, and checked that the results are sensible. Unfortunately I will most probably not have time to debug the issue until weekend, but I can provide explanations to you about the code in case the issue is so critical for you that you urge to fix it yourself

        Show
        Eugene Kirpichov added a comment - Thanks for reporting the problem, Romeo. Actually, I did not compare the results of my implementation with those of R's: I purely implemented the algorithm described in the paper, and checked that the results are sensible. Unfortunately I will most probably not have time to debug the issue until weekend, but I can provide explanations to you about the code in case the issue is so critical for you that you urge to fix it yourself
        Hide
        Luc Maisonobe added a comment -

        The math-296-test.patch is a set of two test cases that reproduce the problem.

        Running the first test shows that at the end of iteration 0, the weights for points at indices 4 and 5 are set to 0. At iteration 1, when computing point i=4, the bandwidth is such that ileft=3 and iright=4. For k=3, we get w=0 because dist=4 and denom=1/4. For k=4 and k=5, we get w=0 because robustness[k]=0. So all w are 0, sumWeights=0 and since it is used in some divisions, NaN appears.

        I think their should be some protection against this case somewhere. I'll ask the author of the original contribution about this.

        Show
        Luc Maisonobe added a comment - The math-296-test.patch is a set of two test cases that reproduce the problem. Running the first test shows that at the end of iteration 0, the weights for points at indices 4 and 5 are set to 0. At iteration 1, when computing point i=4, the bandwidth is such that ileft=3 and iright=4. For k=3, we get w=0 because dist=4 and denom=1/4. For k=4 and k=5, we get w=0 because robustness [k] =0. So all w are 0, sumWeights=0 and since it is used in some divisions, NaN appears. I think their should be some protection against this case somewhere. I'll ask the author of the original contribution about this.
        Hide
        Luc Maisonobe added a comment -

        OK. I reproduce exactly your results on a Linux computer with openJDK 1.6 on an AMD64 processor.
        I'll have a look at this issue.

        Show
        Luc Maisonobe added a comment - OK. I reproduce exactly your results on a Linux computer with openJDK 1.6 on an AMD64 processor. I'll have a look at this issue.
        Hide
        Romeo P added a comment -

        My settings are 0.3 and 4 (these are used as defaults in R).
        If I use the standard values I still get:

        3.100000000000003 6.099999999999999 3.0999999999999996 NaN NaN NaN NaN NaN 7.1 7.85

        Here is my test code for a quick look:

        public void testLoessInterpolator() throws MathException{
        dataX = new double[10];
        dataX[0] = 1.5;
        dataX[1] = 3.0;
        dataX[2] = 6;
        dataX[3] = 8;
        dataX[4] = 12;
        dataX[5] = 13;
        dataX[6] = 22;
        dataX[7] = 24;
        dataX[8] = 28;
        dataX[9] = 31;

        dataY = new double[10];
        dataY[0] = 3.1;
        dataY[1] = 6.1;
        dataY[2] = 3.1;
        dataY[3] = 2.1;
        dataY[4] = 1.4;
        dataY[5] = 5.1;
        dataY[6] = 5.1;
        dataY[7] = 6.1;
        dataY[8] = 7.1;
        dataY[9] = 7.2;

        LoessInterpolator li = new LoessInterpolator();
        double[] ly = li.smooth(dataX, dataY);

        for (int i = 0; i < ly.length; i++)

        { System.out.println(ly[i]); }


        }

        Is there any obvious error which I do not see? Is it possible (although highely unlikely) that some java floating point operations are broken on Vista64 (which I am using)?
        Can you show your results?

        Show
        Romeo P added a comment - My settings are 0.3 and 4 (these are used as defaults in R). If I use the standard values I still get: 3.100000000000003 6.099999999999999 3.0999999999999996 NaN NaN NaN NaN NaN 7.1 7.85 Here is my test code for a quick look: public void testLoessInterpolator() throws MathException{ dataX = new double [10] ; dataX [0] = 1.5; dataX [1] = 3.0; dataX [2] = 6; dataX [3] = 8; dataX [4] = 12; dataX [5] = 13; dataX [6] = 22; dataX [7] = 24; dataX [8] = 28; dataX [9] = 31; dataY = new double [10] ; dataY [0] = 3.1; dataY [1] = 6.1; dataY [2] = 3.1; dataY [3] = 2.1; dataY [4] = 1.4; dataY [5] = 5.1; dataY [6] = 5.1; dataY [7] = 6.1; dataY [8] = 7.1; dataY [9] = 7.2; LoessInterpolator li = new LoessInterpolator(); double[] ly = li.smooth(dataX, dataY); for (int i = 0; i < ly.length; i++) { System.out.println(ly[i]); } } Is there any obvious error which I do not see? Is it possible (although highely unlikely) that some java floating point operations are broken on Vista64 (which I am using)? Can you show your results?
        Hide
        Luc Maisonobe added a comment -

        What are your settings for bandwidth and robustness iterations ?
        I have tried your examples and the first one doesn't lead to NaN with the default settings for me.

        Show
        Luc Maisonobe added a comment - What are your settings for bandwidth and robustness iterations ? I have tried your examples and the first one doesn't lead to NaN with the default settings for me.

          People

          • Assignee:
            Unassigned
            Reporter:
            Romeo P
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development