Mahout
  1. Mahout
  2. MAHOUT-975

Bug in Gradient Machine - Computation of the gradient

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 0.7
    • Fix Version/s: None
    • Component/s: Classification
    • Labels:
      None

      Description

      The initialisation to compute the gradient descent weight updates for the output units should be wrong:

      In the comment: "dy / dw is just w since y = x' * w + b."
      This is wrong. dy/dw is x (ignoring the indices). The same initialisation is done in the code.

      Check by using neural network terminology:

      The gradient machine is a specialized version of a multi layer perceptron (MLP).
      In a MLP the gradient for computing the "weight change" for the output units is:

      dE / dw_ij = dE / dz_i * dz_i / d_ij with z_i = sum_j (w_ij * a_j)
      here: i index of the output layer; j index of the hidden layer
      (d stands for the partial derivatives)

      here: z_i = a_i (no squashing in the output layer)
      with the special loss (cost function) is E = 1 - a_g + a_b = 1 - z_g + z_b
      with
      g index of output unit with target value: +1 (positive class)
      b: random output unit with target value: 0

      =>

      dE / dw_gj = -dE/dz_g * dz_g/dw_gj = -1 * a_j (a_j: activity of the hidden unit
      j)
      dE / dw_bj = -dE/dz_b * dz_b/dw_bj = +1 * a_j (a_j: activity of the hidden unit
      j)

      That's the same if the comment would be correct:
      dy /dw = x (x is here the activation of the hidden unit) * (-1) for weights to
      the output unit with target value +1.

      ------------

      In neural network implementations it's common to compute the gradient
      numerically for a test of the implementation. This can be done by:
      dE/dw_ij = (E(w_ij + epsilon) -E(w_ij - epsilon) ) / (2* (epsilon))

      1. GradientMachine.patch
        3 kB
        Christian Herta
      2. MAHOUT-975.patch
        3 kB
        Suneel Marthi
      3. GradientMachine2.java
        12 kB
        Ted Dunning

        Issue Links

          Activity

          Hide
          Suneel Marthi added a comment -

          The new MLP implementation renders this Jira Obsolete.

          Show
          Suneel Marthi added a comment - The new MLP implementation renders this Jira Obsolete.
          Hide
          Ted Dunning added a comment -

          The ability to pass in a collection of good labels is important to the generality of the Gradient Machine. The limitation to a single category is a limitation of the current Mahout classification API.

          Show
          Ted Dunning added a comment - The ability to pass in a collection of good labels is important to the generality of the Gradient Machine. The limitation to a single category is a limitation of the current Mahout classification API.
          Hide
          Yexi Jiang added a comment -

          The size of goodLabels in updateRanking is always 1 and it seems that there is no need to use a loop.
          Also, the existing test case cannot been passed. The ArrayIndexOutOfBoundsException are thrown.

          Show
          Yexi Jiang added a comment - The size of goodLabels in updateRanking is always 1 and it seems that there is no need to use a loop. Also, the existing test case cannot been passed. The ArrayIndexOutOfBoundsException are thrown.
          Hide
          Ted Dunning added a comment -

          1) The GradientMachine is a special case of MultiLayerPerceptron (MLP) that contains only 1 hidden layer. Is it necessary to have it if the MultiLayerPerceptron is in the plan?

          The Gradient Machine is a bit different in a few different ways.

          First, training is training to rank rather than training to regress. This decreases the number of output weights that need to be updated and allows several clever uses for multi-class problems.

          2) The hiddenToOutput seems not correct. The squashing(activation) function should also apply to the output layer (See [1][2][3][4]). Therefore, the range of the output for each node(neuron) in the output is (0, 1) if Sigmoid function is used, or (-1, 1) if Tanh function is used.

          The gradient machine is normally used in such a way that only the ranking of the outputs matters. This isn't changed by applying the sigmoid to the output layer and this is commonly not done.

          3) There are several problems with the training method. In updateRanking, I don't know which weight update strategy is used, it claims it is back-propagation, but it is not implemented in that way.
          3.1) It seems that only part of the outputWeight are updated (the weights associated with the good output node, and the weights associated with the worst output node. Again, this is OK for two-class problem).
          For back-propagation, all the weights between the last hidden layer and the output layer should be updated. So, is the original designer intentionally design it like that and can guarantee its correctness?
          In the backpropagation way, the delta of each node should be calculated first, and the weight of each node is adjusted based on the corresponding delta. However, in the implemented code,
          3.2) The GradientMachine (and MLP) actually can also be used for regression and prediction. The 'train' method of OnlineLearner restricts its power.

          The learning algorithm appears to be nearly correct. This algorithm is learning to rank rather than regressing. The training method as it stands limits the algorithm to use with 1 of n classification problems, but the internal code can handle a variety of other problems.

          4) The corresponding test case is not enough to test the correctness of the implementation.

          Very true and needs correction.

          5) If all the previous problems have been fixed, it is time to consider the necessity of a map-reduce version of the algorithm.

          Yes. Also true.

          Show
          Ted Dunning added a comment - 1) The GradientMachine is a special case of MultiLayerPerceptron (MLP) that contains only 1 hidden layer. Is it necessary to have it if the MultiLayerPerceptron is in the plan? The Gradient Machine is a bit different in a few different ways. First, training is training to rank rather than training to regress. This decreases the number of output weights that need to be updated and allows several clever uses for multi-class problems. 2) The hiddenToOutput seems not correct. The squashing(activation) function should also apply to the output layer (See [1] [2] [3] [4] ). Therefore, the range of the output for each node(neuron) in the output is (0, 1) if Sigmoid function is used, or (-1, 1) if Tanh function is used. The gradient machine is normally used in such a way that only the ranking of the outputs matters. This isn't changed by applying the sigmoid to the output layer and this is commonly not done. 3) There are several problems with the training method. In updateRanking, I don't know which weight update strategy is used, it claims it is back-propagation, but it is not implemented in that way. 3.1) It seems that only part of the outputWeight are updated (the weights associated with the good output node, and the weights associated with the worst output node. Again, this is OK for two-class problem). For back-propagation, all the weights between the last hidden layer and the output layer should be updated. So, is the original designer intentionally design it like that and can guarantee its correctness? In the backpropagation way, the delta of each node should be calculated first, and the weight of each node is adjusted based on the corresponding delta. However, in the implemented code, 3.2) The GradientMachine (and MLP) actually can also be used for regression and prediction. The 'train' method of OnlineLearner restricts its power. The learning algorithm appears to be nearly correct. This algorithm is learning to rank rather than regressing. The training method as it stands limits the algorithm to use with 1 of n classification problems, but the internal code can handle a variety of other problems. 4) The corresponding test case is not enough to test the correctness of the implementation. Very true and needs correction. 5) If all the previous problems have been fixed, it is time to consider the necessity of a map-reduce version of the algorithm. Yes. Also true.
          Hide
          Yexi Jiang added a comment -

          There are multiple problems (not only bugs) with the GradientMachine (based on Ted's revised version). If there is not time to pay attention to this issue, please ignore it until next week (when 0.8 is released).

          1) The GradientMachine is a special case of MultiLayerPerceptron (MLP) that contains only 1 hidden layer. Is it necessary to have it if the MultiLayerPerceptron is in the plan?

          2) The hiddenToOutput seems not correct. The squashing(activation) function should also apply to the output layer (See [1][2][3][4]). Therefore, the range of the output for each node(neuron) in the output is (0, 1) if Sigmoid function is used, or (-1, 1) if Tanh function is used.

          3) There are several problems with the training method. In updateRanking, I don't know which weight update strategy is used, it claims it is back-propagation, but it is not implemented in that way.

          3.1) It seems that only part of the outputWeight are updated (the weights associated with the good output node, and the weights associated with the worst output node. Again, this is OK for two-class problem).
          For back-propagation, all the weights between the last hidden layer and the output layer should be updated. So, is the original designer intentionally design it like that and can guarantee its correctness?

          In the backpropagation way, the delta of each node should be calculated first, and the weight of each node is adjusted based on the corresponding delta. However, in the implemented code,

          3.2) The GradientMachine (and MLP) actually can also be used for regression and prediction. The 'train' method of OnlineLearner restricts its power.

          4) The corresponding test case is not enough to test the correctness of the implementation.

          5) If all the previous problems have been fixed, it is time to consider the necessity of a map-reduce version of the algorithm.

          Reference:
          [1] Tom Mitchel. Machine Learning. Chapter 4.
          [2] Jiawei Han. Data Mining Concepts and Technologies. Chapter 6.
          [3] Stanford Unsupervised Feature Learning and Deep Learning tutorial. http://ufldl.stanford.edu/wiki/index.php/Neural_Networks. Section Neural Network.
          [4] Christopher Bishop. Neural Networks for Pattern Recognition. Chapter 4.

          Show
          Yexi Jiang added a comment - There are multiple problems (not only bugs) with the GradientMachine (based on Ted's revised version). If there is not time to pay attention to this issue, please ignore it until next week (when 0.8 is released). 1) The GradientMachine is a special case of MultiLayerPerceptron (MLP) that contains only 1 hidden layer. Is it necessary to have it if the MultiLayerPerceptron is in the plan? 2) The hiddenToOutput seems not correct. The squashing(activation) function should also apply to the output layer (See [1] [2] [3] [4] ). Therefore, the range of the output for each node(neuron) in the output is (0, 1) if Sigmoid function is used, or (-1, 1) if Tanh function is used. 3) There are several problems with the training method. In updateRanking, I don't know which weight update strategy is used, it claims it is back-propagation, but it is not implemented in that way. 3.1) It seems that only part of the outputWeight are updated (the weights associated with the good output node, and the weights associated with the worst output node. Again, this is OK for two-class problem). For back-propagation, all the weights between the last hidden layer and the output layer should be updated. So, is the original designer intentionally design it like that and can guarantee its correctness? In the backpropagation way, the delta of each node should be calculated first, and the weight of each node is adjusted based on the corresponding delta. However, in the implemented code, 3.2) The GradientMachine (and MLP) actually can also be used for regression and prediction. The 'train' method of OnlineLearner restricts its power. 4) The corresponding test case is not enough to test the correctness of the implementation. 5) If all the previous problems have been fixed, it is time to consider the necessity of a map-reduce version of the algorithm. Reference: [1] Tom Mitchel. Machine Learning. Chapter 4. [2] Jiawei Han. Data Mining Concepts and Technologies. Chapter 6. [3] Stanford Unsupervised Feature Learning and Deep Learning tutorial. http://ufldl.stanford.edu/wiki/index.php/Neural_Networks . Section Neural Network. [4] Christopher Bishop. Neural Networks for Pattern Recognition. Chapter 4.
          Hide
          Ted Dunning added a comment -

          If you see a significant, low-effort improvement, feel free. Otherwise, let's focus on stuff that needs doing.

          Show
          Ted Dunning added a comment - If you see a significant, low-effort improvement, feel free. Otherwise, let's focus on stuff that needs doing.
          Hide
          Yexi Jiang added a comment -

          Suneel Marthi Do I still need to work on this?

          Show
          Yexi Jiang added a comment - Suneel Marthi Do I still need to work on this?
          Hide
          Suneel Marthi added a comment -

          +1

          Agree about the cleanup and deferring this to backlog.

          Show
          Suneel Marthi added a comment - +1 Agree about the cleanup and deferring this to backlog.
          Hide
          Ted Dunning added a comment -

          How important is this to 0.8?

          The code is currently broken. Fixing it is going to take some days of effort. Nobody is using this right now.

          Why not put it into Backlog?

          Show
          Ted Dunning added a comment - How important is this to 0.8? The code is currently broken. Fixing it is going to take some days of effort. Nobody is using this right now. Why not put it into Backlog?
          Hide
          Ted Dunning added a comment -

          Initial cleanup of code.

          Show
          Ted Dunning added a comment - Initial cleanup of code.
          Hide
          Suneel Marthi added a comment - - edited

          Ted Dunning Please do. I had 10 mins to skim through this code and clean up the patch this morning, agree with u that all of the Vector[] can be replaced by a matrix.

          Show
          Suneel Marthi added a comment - - edited Ted Dunning Please do. I had 10 mins to skim through this code and clean up the patch this morning, agree with u that all of the Vector[] can be replaced by a matrix.
          Hide
          Ted Dunning added a comment -

          I have taken a swipe at this.

          The base code is really ugly (uses arrays of vectors where matrices should be used)

          It also very clumsy in various computations, requiring many passes over weights to do a single operations.

          All of this is independent of the mathematical errors noted earlier.

          I can upload a copy of my revised code if you like.

          Show
          Ted Dunning added a comment - I have taken a swipe at this. The base code is really ugly (uses arrays of vectors where matrices should be used) It also very clumsy in various computations, requiring many passes over weights to do a single operations. All of this is independent of the mathematical errors noted earlier. I can upload a copy of my revised code if you like.
          Hide
          Suneel Marthi added a comment -

          Uploading cleaned up patch.

          Show
          Suneel Marthi added a comment - Uploading cleaned up patch.
          Hide
          Yexi Jiang added a comment -

          Suneel MarthiOK, I will directly working on the latest version of code.

          Show
          Yexi Jiang added a comment - Suneel Marthi OK, I will directly working on the latest version of code.
          Hide
          Suneel Marthi added a comment -

          Yexi Jiang Could you clean up the patch? This is an old patch and could be out of sync with present codebase. I see the misspelling u mention, also see that hiddenActivation has been misspelled as hiddenActivations (in the patch).

          Show
          Suneel Marthi added a comment - Yexi Jiang Could you clean up the patch? This is an old patch and could be out of sync with present codebase. I see the misspelling u mention, also see that hiddenActivation has been misspelled as hiddenActivations (in the patch).
          Hide
          Yexi Jiang added a comment - - edited

          Suneel Marthi When I apply this patch, the source code cannot be compiled. One of the error is that hiddenActivations cannot be resolved. Another error is that the class Functions.NEGATE is misspelled as Function.NEGATE.

          Show
          Yexi Jiang added a comment - - edited Suneel Marthi When I apply this patch, the source code cannot be compiled. One of the error is that hiddenActivations cannot be resolved. Another error is that the class Functions.NEGATE is misspelled as Function.NEGATE.
          Hide
          Suneel Marthi added a comment -

          Yexi Jiang Yes, this week's the deadline. I looked at the patch and it seemed simple, but can definitely help with a different set of eyes looking at it.

          Show
          Suneel Marthi added a comment - Yexi Jiang Yes, this week's the deadline. I looked at the patch and it seemed simple, but can definitely help with a different set of eyes looking at it.
          Hide
          Yexi Jiang added a comment -

          Suneel Marthi Sure, I'd like to try it. Is the deadline end of this week?

          Show
          Yexi Jiang added a comment - Suneel Marthi Sure, I'd like to try it. Is the deadline end of this week?
          Hide
          Suneel Marthi added a comment -

          Yexi Jiang Would you like take a stab at this patch and JIRA? This is needed for Mahout 0.8 release. I like your critique on MAHOUT-976 which I found interesting.

          Show
          Suneel Marthi added a comment - Yexi Jiang Would you like take a stab at this patch and JIRA? This is needed for Mahout 0.8 release. I like your critique on MAHOUT-976 which I found interesting.
          Hide
          Ted Dunning added a comment -

          Later in the week, yes.

          So far, I have been a helpless by-stander watching all this activity while
          incapable of responding or helping to now.

          Show
          Ted Dunning added a comment - Later in the week, yes. So far, I have been a helpless by-stander watching all this activity while incapable of responding or helping to now.
          Hide
          Grant Ingersoll added a comment -

          Ted Dunning Any chance this is getting in this week?

          Show
          Grant Ingersoll added a comment - Ted Dunning Any chance this is getting in this week?
          Hide
          Sebastian Schelter added a comment -

          Ted Dunning, can you have a look at this?

          Show
          Sebastian Schelter added a comment - Ted Dunning , can you have a look at this?
          Hide
          Hector Yee added a comment -

          If MLP is a more general implementation why don't you just delete Gradient Machine?

          Show
          Hector Yee added a comment - If MLP is a more general implementation why don't you just delete Gradient Machine?
          Hide
          Christian Herta added a comment -

          Sorry, for the singular plural problem. I thought I compiled the code before I produced the patch. I was not familiar in producing patches for mahout.

          I just studied the code theoretical, because I was interested if there was a Multi Layer Perceptron (mlp) implementation in Mahout. But I am quite sure that the calculation of the gradient is not correct (see bug description).

          Because the gradient machine is a very specialised version of a mlp, I started a more general implementation: MAHOUT-976. The sgd version of the mlp seems to be quite stable. But, I just use it (at moment) to learn an autoencoder.

          I going to go in holidays for 2 weeks without a computer. Then I will compare the bug fix of the Gradient Machine with the original version with training/test data.

          Show
          Christian Herta added a comment - Sorry, for the singular plural problem. I thought I compiled the code before I produced the patch. I was not familiar in producing patches for mahout. I just studied the code theoretical, because I was interested if there was a Multi Layer Perceptron (mlp) implementation in Mahout. But I am quite sure that the calculation of the gradient is not correct (see bug description). Because the gradient machine is a very specialised version of a mlp, I started a more general implementation: MAHOUT-976 . The sgd version of the mlp seems to be quite stable. But, I just use it (at moment) to learn an autoencoder. I going to go in holidays for 2 weeks without a computer. Then I will compare the bug fix of the Gradient Machine with the original version with training/test data.
          Hide
          Lance Norskog added a comment -

          The newest patch does not compile against the trunk. There is a singular/plural problem with one of the variables.

          I have tested this with the SGD classification example/bin/classify-20newsgroups.sh. The total accuracy dropped from 71% to 62%. The SGD example for Apache emails (subset of commons v.s. cocoon) does not work well, so I can't evaluate it with that.

          Can you suggest a public dataset where this change works better than the trunk?

          Show
          Lance Norskog added a comment - The newest patch does not compile against the trunk. There is a singular/plural problem with one of the variables. I have tested this with the SGD classification example/bin/classify-20newsgroups.sh. The total accuracy dropped from 71% to 62%. The SGD example for Apache emails (subset of commons v.s. cocoon) does not work well, so I can't evaluate it with that. Can you suggest a public dataset where this change works better than the trunk?
          Hide
          Christian Herta added a comment -

          There is a patch as an attachment of MAHOUT-975

          The code has only changed little. Also I have restructured the code and put some comments in it.

          Show
          Christian Herta added a comment - There is a patch as an attachment of MAHOUT-975 The code has only changed little. Also I have restructured the code and put some comments in it.
          Hide
          Christian Herta added a comment -

          Patch for Mahout-975

          Show
          Christian Herta added a comment - Patch for Mahout-975

            People

            • Assignee:
              Ted Dunning
              Reporter:
              Christian Herta
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development