Details

      Description

      Implement a multi layer perceptron

      • via Matrix Multiplication
      • Learning by Backpropagation; implementing tricks by Yann LeCun et al.: "Efficent Backprop"
      • arbitrary number of hidden layers (also 0 - just the linear model)
      • connection between proximate layers only
      • different cost and activation functions (different activation function in each layer)
      • test of backprop by gradient checking
      • normalization of the inputs (storeable) as part of the model

      First:

      • implementation "stocastic gradient descent" like gradient machine
      • simple gradient descent incl. momentum

      Later (new jira issues):

      • Distributed Batch learning (see below)
      • "Stacked (Denoising) Autoencoder" - Feature Learning
      • advanced cost minimazation like 2nd order methods, conjugate gradient etc.

      Distribution of learning can be done by (batch learning):
      1 Partioning of the data in x chunks
      2 Learning the weight changes as matrices in each chunk
      3 Combining the matrixes and update of the weights - back to 2
      Maybe this procedure can be done with random parts of the chunks (distributed quasi online learning).
      Batch learning with delta-bar-delta heuristics for adapting the learning rates.

      1. MAHOUT-976.patch
        19 kB
        Christian Herta
      2. MAHOUT-976.patch
        27 kB
        Christian Herta
      3. MAHOUT-976.patch
        30 kB
        Christian Herta
      4. MAHOUT-976.patch
        37 kB
        Christian Herta

        Issue Links

          Activity

          Hide
          Ted Dunning added a comment -

          Seems like a dupe to me. Yexi has incorporated the good bits.

          Show
          Ted Dunning added a comment - Seems like a dupe to me. Yexi has incorporated the good bits.
          Hide
          Suneel Marthi added a comment -

          Can this be marked as Duplicate of M-1265 since the code for M-1265 was committed to trunk?

          Show
          Suneel Marthi added a comment - Can this be marked as Duplicate of M-1265 since the code for M-1265 was committed to trunk?
          Hide
          Yexi Jiang added a comment -

          MAHOUT-1265 is actually a new implementation of MLP based on Ted's comments. For example, the users can freely configure the layer but setting the number of neurons, the squashing function. Also, the users can also set the kind of cost function and the parameters like learning rate, momemtum weight and so on.

          Show
          Yexi Jiang added a comment - MAHOUT-1265 is actually a new implementation of MLP based on Ted's comments. For example, the users can freely configure the layer but setting the number of neurons, the squashing function. Also, the users can also set the kind of cost function and the parameters like learning rate, momemtum weight and so on.
          Hide
          Suneel Marthi added a comment -

          Ted Dunning There is another JIRA (see MAHOUT-1265) for a Multilayer Perceptron with Yexi's patch. Can this be marked as 'Duplicate' and closed?

          Show
          Suneel Marthi added a comment - Ted Dunning There is another JIRA (see MAHOUT-1265 ) for a Multilayer Perceptron with Yexi's patch. Can this be marked as 'Duplicate' and closed?
          Hide
          Ted Dunning added a comment -

          Your critique sounds interesting so far.

          I haven't heard a good plan for parallelism in a map-reduce style.

          And you are free to build some code. You don't need any approval to develop it.

          Show
          Ted Dunning added a comment - Your critique sounds interesting so far. I haven't heard a good plan for parallelism in a map-reduce style. And you are free to build some code. You don't need any approval to develop it.
          Hide
          Yexi Jiang added a comment -

          No feedback?

          Show
          Yexi Jiang added a comment - No feedback?
          Hide
          Yexi Jiang added a comment -

          Hi,

          I read the source code from the patch files (all the four versions) and have the following questions.

          1) It seems that the source code has not fully implemented the distributed MLP.

          Based on my understanding, the algorithm designer intends to make the implemented MLP generic enough so that it can be used both in single machine scenario and distributed scenario.

          For the single machine scenario, the user can easily reuse the algorithm by writing similar code in the test cases. But for the distributed version, the user has to implement the mapper to load all the training data. And then he needs to create a MLP instance inside the mapper and train it with the incoming data. Moreover, the user has to come up with a solution to merge all the MLP weight updating in each mapper instance, which is not trivial.

          Therefore, it seems that the current implementation does no more than a single machine version of MLP.

          2) The dimension of target Vector feed to trainOnline is always 1. This is because the actual is always an integer, and there is no post-process to make it a mutual class vector.

          The following is the call sequence.
          train -> trainOnline -> getDerivativeOfTheCostWithoutRegularization -> getOutputDeltas -> AbstractVector.assign(Vector v, DoubleDoubleFunction f)

          The assign method would check whether v equals to this.size. In the MLP scenario, it will check whether the size of output layer equals the size of class label.

          And the following is the related code.
          ------------------------------
          public void train(long trackingKey, String groupKey, int actual,
          Vector instance)

          { // training with one pattern Vector target = new DenseVector(1); target.setQuick(0, (double) actual); trainOnline(instance, target); }

          ------------------------------

          The reason why it passes the test cases is because the test case just create the MLP with size 1 output layer.

          So, I'm wondering whether the argument list of train should be changed, or argument 'actual' should be transformed internally.

          I have implemented a BSP based distributed MLP, and the code has already by committed to apache hama machine learning package. The BSP version is not difficult to adapt to the mapreduce framework. If it is OK, I can change my existing code and contribute it the mahout.

          Show
          Yexi Jiang added a comment - Hi, I read the source code from the patch files (all the four versions) and have the following questions. 1) It seems that the source code has not fully implemented the distributed MLP. Based on my understanding, the algorithm designer intends to make the implemented MLP generic enough so that it can be used both in single machine scenario and distributed scenario. For the single machine scenario, the user can easily reuse the algorithm by writing similar code in the test cases. But for the distributed version, the user has to implement the mapper to load all the training data. And then he needs to create a MLP instance inside the mapper and train it with the incoming data. Moreover, the user has to come up with a solution to merge all the MLP weight updating in each mapper instance, which is not trivial. Therefore, it seems that the current implementation does no more than a single machine version of MLP. 2) The dimension of target Vector feed to trainOnline is always 1. This is because the actual is always an integer, and there is no post-process to make it a mutual class vector. The following is the call sequence. train -> trainOnline -> getDerivativeOfTheCostWithoutRegularization -> getOutputDeltas -> AbstractVector.assign(Vector v, DoubleDoubleFunction f) The assign method would check whether v equals to this.size. In the MLP scenario, it will check whether the size of output layer equals the size of class label. And the following is the related code. ------------------------------ public void train(long trackingKey, String groupKey, int actual, Vector instance) { // training with one pattern Vector target = new DenseVector(1); target.setQuick(0, (double) actual); trainOnline(instance, target); } ------------------------------ The reason why it passes the test cases is because the test case just create the MLP with size 1 output layer. So, I'm wondering whether the argument list of train should be changed, or argument 'actual' should be transformed internally. I have implemented a BSP based distributed MLP, and the code has already by committed to apache hama machine learning package. The BSP version is not difficult to adapt to the mapreduce framework. If it is OK, I can change my existing code and contribute it the mahout.
          Hide
          Robin Anil added a comment -

          I see a few system.out.println() please remove those. Also use the Mahout eclipse code formatter to format the files. Christian Herta will you be able to work on these quickly? I am pushing it off the 0.8 list. If you can work on it, please update it and we will review it.

          Show
          Robin Anil added a comment - I see a few system.out.println() please remove those. Also use the Mahout eclipse code formatter to format the files. Christian Herta will you be able to work on these quickly? I am pushing it off the 0.8 list. If you can work on it, please update it and we will review it.
          Hide
          Robin Anil added a comment -

          Marking Ted as the assignee, who is the best reviewer for this.

          Show
          Robin Anil added a comment - Marking Ted as the assignee, who is the best reviewer for this.
          Hide
          Sebastian Schelter added a comment -

          What's the status on this? Can we include it for 0.8?

          Show
          Sebastian Schelter added a comment - What's the status on this? Can we include it for 0.8?
          Hide
          Christian Herta added a comment -

          Until now I have implemented only autoencoder code which is very special for multilayer perceptrons: weight changes for the constraint "sparse autoencoder".
          I agree that it's a good idea to have a general autoencoder framework, e.g. for stacking autoencoders and for training autoencoders.
          When I have some time I will have a look how to get rbm and mlp autoencoders together.

          Show
          Christian Herta added a comment - Until now I have implemented only autoencoder code which is very special for multilayer perceptrons: weight changes for the constraint "sparse autoencoder". I agree that it's a good idea to have a general autoencoder framework, e.g. for stacking autoencoders and for training autoencoders. When I have some time I will have a look how to get rbm and mlp autoencoders together.
          Hide
          Dirk Weißenborn added a comment - - edited

          You can also take a look at the training itself in this patch since it is actually also a batch learning algorithm. I also implemented a none map/reduce based approach using multiple threads. I think you can save a lot of time by reusing already tested code since it is pretty similar to this task.

          Show
          Dirk Weißenborn added a comment - - edited You can also take a look at the training itself in this patch since it is actually also a batch learning algorithm. I also implemented a none map/reduce based approach using multiple threads. I think you can save a lot of time by reusing already tested code since it is pretty similar to this task.
          Hide
          Dirk Weißenborn added a comment -

          I saw that you are planning to implement also autoencoders! If I understood them right, they work exactly like RBMs just that they are not stochastically driven. The RBM implementation I mentioned provides already this functionality (through a method called setProbabilitiesAsActivation() in the Layer interface) and it is only necessary to write a training algorithm. I think it would be easier to take a look there before implementing everything again (MAHOUT-968).

          Show
          Dirk Weißenborn added a comment - I saw that you are planning to implement also autoencoders! If I understood them right, they work exactly like RBMs just that they are not stochastically driven. The RBM implementation I mentioned provides already this functionality (through a method called setProbabilitiesAsActivation() in the Layer interface) and it is only necessary to write a training algorithm. I think it would be easier to take a look there before implementing everything again ( MAHOUT-968 ).
          Hide
          Christian Herta added a comment -
          • tests added (incl. gradient checking)
          • bug fixed
          • experimental sparse autoencoder (subclass of mlp - not tested)
          • minor improvements
          Show
          Christian Herta added a comment - tests added (incl. gradient checking) bug fixed experimental sparse autoencoder (subclass of mlp - not tested) minor improvements
          Hide
          Christian Herta added a comment -
          • momentum term included
          • read-write mlp (completely untested)
          Show
          Christian Herta added a comment - momentum term included read-write mlp (completely untested)
          Hide
          Christian Herta added a comment -

          Hallo Ted,

          thanks for the hints. I will have a look at the algorithms.

          First, I would suggest to implement a simple distributed batch learning method. This can be a baseline for other procedures and algorithms.

          Before implementing a more sophisticated solution (state-of-the-art in distributed learning for mlps) I need to study the literature in detail.

          Further Literature as a starting point:

          • "On optimazation methods for deep learning" www.icml-2011.org/papers/210_icmlpaper.pdf
          • "Deep Learning via Hessian-free Optimization" www.icml2010.org/papers/458.pdf

          Cheers Christian

          Show
          Christian Herta added a comment - Hallo Ted, thanks for the hints. I will have a look at the algorithms. First, I would suggest to implement a simple distributed batch learning method. This can be a baseline for other procedures and algorithms. Before implementing a more sophisticated solution (state-of-the-art in distributed learning for mlps) I need to study the literature in detail. Further Literature as a starting point: "On optimazation methods for deep learning" www.icml-2011.org/papers/210_icmlpaper.pdf "Deep Learning via Hessian-free Optimization" www.icml2010.org/papers/458.pdf Cheers Christian
          Hide
          Ted Dunning added a comment -

          Also, John has had very good results in Vowpal Wabbit with an allreduce operation in his learning system. The way that this works is that he launches a map-only learning task which reads inputs repeatedly and propagates the gradient vector every pass over the data using an all-reduce operation. All reduce applies an associative aggregation to a data structure in a tree structure imposed on the network. The result of the aggregation is passed back down the tree to all nodes.

          This allows fast iteration of learning and could also speed up our k-means codes massively. Typically, this improves speeds by about 2 orders of magnitude because the horrid costs of Hadoop job starts go away.

          Would you be interested in experimenting with this in your parallel implementation here?

          Show
          Ted Dunning added a comment - Also, John has had very good results in Vowpal Wabbit with an allreduce operation in his learning system. The way that this works is that he launches a map-only learning task which reads inputs repeatedly and propagates the gradient vector every pass over the data using an all-reduce operation. All reduce applies an associative aggregation to a data structure in a tree structure imposed on the network. The result of the aggregation is passed back down the tree to all nodes. This allows fast iteration of learning and could also speed up our k-means codes massively. Typically, this improves speeds by about 2 orders of magnitude because the horrid costs of Hadoop job starts go away. Would you be interested in experimenting with this in your parallel implementation here?
          Hide
          Ted Dunning added a comment -

          Christian,

          Does confidence weighted learning help with MLP's?

          Also, should we move forward with a L-BFGS implementation? I have heard from John Langford that it is very useful to start with stochastic gradient descent (aka back-prop) and use L-BFGS for finishing off the learning. That same approach should work reasonably well with MLP's as well, although it may take a bit longer to get into the region where BFGS wins.

          Show
          Ted Dunning added a comment - Christian, Does confidence weighted learning help with MLP's? Also, should we move forward with a L-BFGS implementation? I have heard from John Langford that it is very useful to start with stochastic gradient descent (aka back-prop) and use L-BFGS for finishing off the learning. That same approach should work reasonably well with MLP's as well, although it may take a bit longer to get into the region where BFGS wins.
          Hide
          Christian Herta added a comment - - edited

          new patch available:

          • Basis implementation
          • simple tests for propagation and learning by backprop
          Show
          Christian Herta added a comment - - edited new patch available: Basis implementation simple tests for propagation and learning by backprop
          Hide
          Christian Herta added a comment - - edited

          patch MAHOUT-976
          uncomplete and completely untested
          should only compile

          Show
          Christian Herta added a comment - - edited patch MAHOUT-976 uncomplete and completely untested should only compile
          Hide
          Christian Herta added a comment - - edited

          The implementation of public Vector classifyFull(Vector r, Vector instance) in AbstractVectorClassifier assumes that the probabilities of the n elements of the output vector sum to 1. This is only valid if there are n mutually exclusive classes. e.g. for the target vectors like (0 0 1 0), (0 0 0 1), (1 0 0 0), ....

          The other posibility is, that there are n (here 4) independent target classes like: (1 0 0 1), (0 0 0 0), (0 1 1 1), (1 1 1 1), (0 0 1 0)
          Here the method "Vector classify(..)" and the implementation "public Vector classifyFull(Vector r, Vector instance)" of AbstractVectorClassfier makes no sense. Therefore using "Vector classify(..)" should throw an exception and "Vector classifyFull" must be overwritten.

          P.S.: Depending on the "flag" the cost function and the activation function for the output units will be set, to get probabilities as outputs see e.g. C. Bishop: "Pattern Recognition and Machine Learning", chapter 5.2. Also, this simplifies the implementation because the natural pairing between cost and activation function yields for the output deltas "y - t".

          Show
          Christian Herta added a comment - - edited The implementation of public Vector classifyFull(Vector r, Vector instance) in AbstractVectorClassifier assumes that the probabilities of the n elements of the output vector sum to 1. This is only valid if there are n mutually exclusive classes. e.g. for the target vectors like (0 0 1 0), (0 0 0 1), (1 0 0 0), .... The other posibility is, that there are n (here 4) independent target classes like: (1 0 0 1), (0 0 0 0), (0 1 1 1), (1 1 1 1), (0 0 1 0) Here the method "Vector classify(..)" and the implementation "public Vector classifyFull(Vector r, Vector instance)" of AbstractVectorClassfier makes no sense. Therefore using "Vector classify(..)" should throw an exception and "Vector classifyFull" must be overwritten. P.S.: Depending on the "flag" the cost function and the activation function for the output units will be set, to get probabilities as outputs see e.g. C. Bishop: "Pattern Recognition and Machine Learning", chapter 5.2. Also, this simplifies the implementation because the natural pairing between cost and activation function yields for the output deltas "y - t".
          Hide
          Ted Dunning added a comment -

          I would leave classify as it stands. If the user wants that sort of function, it can be very convenient. As you point out, classifyFull is probably more useful for lots of applications.

          Why do you need the flag? Why not just let the user decide how to use their model?

          Show
          Ted Dunning added a comment - I would leave classify as it stands. If the user wants that sort of function, it can be very convenient. As you point out, classifyFull is probably more useful for lots of applications. Why do you need the flag? Why not just let the user decide how to use their model?
          Hide
          Christian Herta added a comment -

          The AbstractVectorClassifier Method classify(..) assumes that in general there are n mutually exclusive classes. This is also the standard characteristics of the convenience function classifyFull(..). For a Multilayer Perceptron this is not necessary the case. In the current work-in -progress implementation this will be configured in the constructor of the MLP by a boolean "mutuallyExclusiveClasses".

          I could overwrite classifyFull and throw a UnsupportedOperationException() if classify is used for "mutuallyExclusiveClasses = false". But I assume that would be confusing for the user.

          Is there a better solution?

          Show
          Christian Herta added a comment - The AbstractVectorClassifier Method classify(..) assumes that in general there are n mutually exclusive classes. This is also the standard characteristics of the convenience function classifyFull(..). For a Multilayer Perceptron this is not necessary the case. In the current work-in -progress implementation this will be configured in the constructor of the MLP by a boolean "mutuallyExclusiveClasses". I could overwrite classifyFull and throw a UnsupportedOperationException() if classify is used for "mutuallyExclusiveClasses = false". But I assume that would be confusing for the user. Is there a better solution?
          Hide
          Dirk Weißenborn added a comment -

          I already implemented backpropagation for special multilayer anns (deep boltzmann machines). Still, you can use the layer implementations which should provide everything you need for backprop (hopefully), MAHOUT-968 . The backprop is still naive cause it is not the most important part of training, but it would still be nice to have optimized backprop.

          Show
          Dirk Weißenborn added a comment - I already implemented backpropagation for special multilayer anns (deep boltzmann machines). Still, you can use the layer implementations which should provide everything you need for backprop (hopefully), MAHOUT-968 . The backprop is still naive cause it is not the most important part of training, but it would still be nice to have optimized backprop.
          Hide
          Viktor Gal added a comment -

          Although it's not the same (but again a NN) and afaik the learning is sequential, but it's worth to check out the restricted boltzmann machine implementation that has been just submitted to MAHOUT-968

          Show
          Viktor Gal added a comment - Although it's not the same (but again a NN) and afaik the learning is sequential, but it's worth to check out the restricted boltzmann machine implementation that has been just submitted to MAHOUT-968

            People

            • Assignee:
              Suneel Marthi
              Reporter:
              Christian Herta
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 80h
                80h
                Remaining:
                Remaining Estimate - 80h
                80h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development