Mahout
  1. Mahout
  2. MAHOUT-375

[GSOC] Restricted Boltzmann Machines in Apache Mahout

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Proposal Title: Restricted Boltzmann Machines in Apache Mahout (addresses issue Mahout-329)

      Student Name: Sisir Koppaka

      Student E-mail: sisir.koppaka@gmail.com

      Organization/Project:Assigned Mentor:

      Abstract
      This is a proposal to implement Restricted Boltzmann Machines in Apache Mahout as a part of Google Summer of Code 2010. The demo for the code would be built on the Netflix dataset.

      1 Introduction
      The Grand Prize solution to the Netflix Prize offered several new lessons in the application of traditional machine learning techniques to very large scale datasets. The most significant among these were the impact of temporal models, the remarkable contribution of RBM's to the solution in the overall model, and the great success in applying ensemble models to achieve superior predictions. The present proposal seeks to implement a conditional factored RBM[4] in Apache Mahout as a project under Google Summer of Code 2010.

      2 Background
      The Netflix dataset takes the form of a sparse matrix of a N X M ratings that N users assign to M movies. Matrix decompositions such as variants of Singular Value Decompositions(SVDs) form the first type of methods applied. This has also induced several recent works in applied mathematics relevant to the Netflix Prize, including [1, 2]. Another genre of techniques have been k-nearest neighbour approaches - user-user, movie-movie and using different
      distance measures such as Pearson and Cosine. The third set of techniques that offers arguably the most divergent predictions that aid in the increase in prediction RMSE are RBM and it's variants.

      [4] demonstrates the algorithm that the author proposes to implement this summer in Apache Mahout. Parallelization can be done by updating all the hidden units, followed by the visible units in parallel, due to the conditional independence of the hidden units, given a visible binary indicator matrix. Rather than implementing a naive RBM, the conditional factored RBM is chosen due to it's useful combination of effectiveness and speed. Minor variations, in any case, could be developed later with little difficulty.

      The training data set consists of nearly 100 million ratings from 480,000 users on 17,770 movie titles. As part of the training data, Netflix also pro- vides validation data(called the probe set), containing nearly 1.4 million rat- ings. In addition to the training and validation data, Netflix also provides a test set containing 2.8 million user/movie pairs(called the qualifying set) whose ratings were previously withheld, but have now been released post the conclusion of the Prize.

      3 Milestones

      3.1 April 26-May 24
      Community Bonding Period Certain boilerplate code for the Netflix dataset exists at org.apache.mahout.cf.taste.example.netflix. However, this code is non-distributed and is unrelated to Hadoop. Certain parts of this code, like the file read-in based on Netflix format will be reused to match the processed Netflix dataset file linked below.

      Test out any of the already-implemented Mahout algorithms like SVD or k-Means on the whole dataset to make sure that everything works as ad- vertised. Make a note of testing time. If testing time is very large, then make a 10% training set, and use the 10% probe, which already exists as a standardized Netflix Prize community contribution. This is only so that iterations can be faster/a multiple node Hadoop installation need not al- ways be required. Work on the map-reduce version of RBM and evaluate if parallelization beyond the hidden units and visible units alternate computa- tion can be implemented. Get the community's approval for the map-reduce version of RBM, and then proceed.

      3.2 May 24-July 12 Pre-midterm
      Implementation time! Write code, test code, rewrite code.
      Should have working code with decent predictions by end of this segment.

      Design details
      The RBM code would live at org.apache.mahout.classifier.rbm. Classify.java would need to be written to support the RBM similar to that in discriminative. An equivalent of BayesFileFormatter.java would not be required because of the pre-written Netflix read-in code as mentioned above. ConfusionMatrix.java, ResultAnalyzer.java and ClassifyResult.java would be reused as-is from discriminative.
      algorithm would contain the actual conditional factored RBM algorithm. common would contain the relevant code common to various files in algo- rithm. mapreduce.rbm would contain the driver, mapper and reducer for the parallelized updating of the hidden units layer, followed by the visible units, and appropriate refactored code would be placed in mapreduce.common.

      The algorithm would be implemented generically, and the demo would be on the Netflix dataset.

      3.3 July 12-July 31 Post-midterm

      If testing was on the 10% set, run multiple times on the whole dataset and ensure results match. Test a two-method ensemble of SVD(already in Mahout) and RBM and confirm that RBM offers a unique perpendicular dimensionality to the problem. Make sure unit tests are all in.
      Test on Netflix dataset linked above and prepare for demo.

      3.4 July 31-August 16 Pencils down
      Refactor, tune and clean code. Final demo done. Write documentation and add a wiki page.

      4 About Myself
      Im a 19-year old student hailing from the beautiful, sea-hugging, coastal city of Visakhapatnam in Andhra Pradesh, South India. In 2006, I was one of
      the only 110 students in the country to be awarded a scholarship from the Indian Institute of Science and the Department of Science and Technology, Government of India, under the KVPY programme and I attended their summer camp that year.

      I interned in the Prediction Algorithms Lab at GE Research, Bangalore, last summer. I worked on custom-toolkit in C++ that implemented various Netflix algorithms and operated using data parallelization for some of the more lenghty algorithms like user-user kNN. Our team stood at rank 409 at the end of the two-month internship, when the Grand Prize was awarded.

      I have also published independent work in GECCO 2010[3]. GECCO is ACM SIGEVO's annual conference, and is ranked 11th out of 701 interna- tional conferences in AI, ML, Robotics, and HCI based on it's impact factor.

      I have also contributed code to FFmpeg, and was part of my Hall of Residence's award-winning Java-based Open Soft project team that we have now open sourced. I am also an avid quizzer and have won several prestigious quizzes during my schooling days. I also got the 4th rank in the regional qualifications for the Indian National Mathematics Olympiad.

      References
      [1] E. J. Candes and T. Tao. The Power of Convex Relaxation: Near-Optimal Matrix Completion. Arxiv, 2009.
      [2] R. H. Keshavan, A. Montanari, and S. Oh. Matrix Completion from Noisy Entries. Arxiv, 2009.
      [3] S. Koppaka and A. R. Hota. Superior Exploration-Exploitation Balance with Quantum-Inspired Hadamard Walks. Proceedings of the 12th Annual Conference on Genetic and Evolutionary computation - GECCO '10 to appear, 2010.
      [4] R. Salakhutdinov, A. Mnih, and G. Hinton. Restricted Boltzmann Machines for Collaborative Filtering. Proceedings of the 24th International Conference on Machine Learning, Corvallis, OR,2007, 6, 2007.

      Open Source Java Project - http://blunet.googlecode.com
      FFmpeg patches
      http://git.ffmpeg.org/?p=ffmpeg;a=commitdiff;h=16a043535b91595bf34d7e044ef398067e7443e0
      http://git.ffmpeg.org/?p=ffmpeg;a=commitdiff;h=9dde37a150ce2e5c53e2295d09efe289cebea9cd

      1. MAHOUT-375.diff
        26 kB
        Sisir Koppaka
      2. MAHOUT-375.patch
        78 kB
        Sisir Koppaka
      3. MAHOUT-375.patch
        74 kB
        Sisir Koppaka
      4. MAHOUT-375.patch
        68 kB
        Sisir Koppaka
      5. MAHOUT-375.patch
        27 kB
        Jake Mannix

        Issue Links

          Activity

          Hide
          Sisir Koppaka added a comment - - edited

          Moved the proposal to JIRA. I've processed the Netflix dataset in the format Sean suggested on mahout-dev and put it up at https://mahout-rbm-netflix.s3.amazonaws.com/netflix-dataset-wodates.csv (click to view in browser/download). The format is [movieID, userID, rating]. [The complete file is 1.5GB, so view in browser unless you need the whole file! :)]

          During the proposal consideration time, I am implementing a version of the RBM that is not conditional, not factored, and not parallelized yet. I will submit this tomorrow after testing. Since testing on my machine alone is right now is a bit time taking for the Netflix dataset, are there any datasets that you could suggest to me for quicker testing of the RBM - at least for now? If the test dataset has some results on RBM that I can compare with, that'd really help me with the testing.

          Show
          Sisir Koppaka added a comment - - edited Moved the proposal to JIRA. I've processed the Netflix dataset in the format Sean suggested on mahout-dev and put it up at https://mahout-rbm-netflix.s3.amazonaws.com/netflix-dataset-wodates.csv (click to view in browser/download). The format is [movieID, userID, rating] . [The complete file is 1.5GB, so view in browser unless you need the whole file! :)] During the proposal consideration time, I am implementing a version of the RBM that is not conditional, not factored, and not parallelized yet. I will submit this tomorrow after testing. Since testing on my machine alone is right now is a bit time taking for the Netflix dataset, are there any datasets that you could suggest to me for quicker testing of the RBM - at least for now? If the test dataset has some results on RBM that I can compare with, that'd really help me with the testing.
          Hide
          Sisir Koppaka added a comment -

          There is an implementation of a pure RBM here. However, the pure RBM uses the data in a format which is based on a bitmasking method which is not what we would do finally. At this point, I am not 100% sure of the dataModels and Mahout-specific data structures to use here to make sure things scale. I hope to do that by next week.

          Show
          Sisir Koppaka added a comment - There is an implementation of a pure RBM here. However, the pure RBM uses the data in a format which is based on a bitmasking method which is not what we would do finally. At this point, I am not 100% sure of the dataModels and Mahout-specific data structures to use here to make sure things scale. I hope to do that by next week.
          Hide
          Jake Mannix added a comment -

          Looks like the patch you submitted, Sisir, was probably off of a git branch, and you didn't to diff with the --no-prefix option, because it put the source in "b/core/src/..." instead of "core/src/...". This patch should apply cleanly to current trunk.

          It doesn't compile, however, lots of missing bitmask constants like USER_MOVIEMASK, which are marked "TODO replace", and the RBMRecommender is in even worse shape.

          Good to see code, though, in any shape!

          Show
          Jake Mannix added a comment - Looks like the patch you submitted, Sisir, was probably off of a git branch, and you didn't to diff with the --no-prefix option, because it put the source in "b/core/src/..." instead of "core/src/...". This patch should apply cleanly to current trunk. It doesn't compile, however, lots of missing bitmask constants like USER_MOVIEMASK, which are marked "TODO replace", and the RBMRecommender is in even worse shape. Good to see code, though, in any shape!
          Hide
          Jake Mannix added a comment -

          By the way, Sisir, you need to make sure to click the "Grant Apache License" button, so that your code is actually allowed to be used by others without violating your copyright. Now that I notice you didn't click that, I don't think the fact that I clicked it on a trivial change on your patch is valid. Please resubmit granting license to the ASF, thanks!

          Show
          Jake Mannix added a comment - By the way, Sisir, you need to make sure to click the "Grant Apache License" button, so that your code is actually allowed to be used by others without violating your copyright. Now that I notice you didn't click that, I don't think the fact that I clicked it on a trivial change on your patch is valid. Please resubmit granting license to the ASF, thanks!
          Hide
          Robin Anil added a comment -

          And yes, the code needs to be formatted to conform with the checkstyle . Its easier to start with good code style than needing to mass fix the code later. Its as easy as importing the Eclipse formatter(available on the wiki) in Eclipse and Ctrl+Shift+F . Also you can run mvn checkstyle:checkstyle to check the code for any flaws. But don't get hung up on removing all errors, just make sure its 90-95% ok.

          Show
          Robin Anil added a comment - And yes, the code needs to be formatted to conform with the checkstyle . Its easier to start with good code style than needing to mass fix the code later. Its as easy as importing the Eclipse formatter(available on the wiki) in Eclipse and Ctrl+Shift+F . Also you can run mvn checkstyle:checkstyle to check the code for any flaws. But don't get hung up on removing all errors, just make sure its 90-95% ok.
          Hide
          Sisir Koppaka added a comment -

          Attached current version of the implementation. I am not sure on a few things, will clear it up on dev@m.a.o and then update again. The latest version of the implementation autopdates at http://github.com/sisirkoppaka/mahout-rbm/compare/trunk...rbm

          Show
          Sisir Koppaka added a comment - Attached current version of the implementation. I am not sure on a few things, will clear it up on dev@m.a.o and then update again. The latest version of the implementation autopdates at http://github.com/sisirkoppaka/mahout-rbm/compare/trunk...rbm
          Show
          Sisir Koppaka added a comment - Updated with RBMStateWritable. http://github.com/sisirkoppaka/mahout-rbm/commit/5309923d92a311a0590a35c6e164462b71306707
          Hide
          Sisir Koppaka added a comment -

          Compiles cleanly. Doesn't do everything it's supposed to do due to misplaced function calls. Fixing that, and will put that up after testing in a few hours.

          Show
          Sisir Koppaka added a comment - Compiles cleanly. Doesn't do everything it's supposed to do due to misplaced function calls. Fixing that, and will put that up after testing in a few hours.
          Hide
          Grant Ingersoll added a comment -

          I'm not able to get this last patch to compile. I think this is due to some of the changes in the command line processing.

          Show
          Grant Ingersoll added a comment - I'm not able to get this last patch to compile. I think this is due to some of the changes in the command line processing.
          Hide
          Grant Ingersoll added a comment -

          I'm not an expert in RBMs, but they are classifiers, right? So, would it make sense to think of it as being based in the classifier packages and then also having a recommender layer on top of it?

          Show
          Grant Ingersoll added a comment - I'm not an expert in RBMs, but they are classifiers, right? So, would it make sense to think of it as being based in the classifier packages and then also having a recommender layer on top of it?
          Hide
          Grant Ingersoll added a comment -

          Sisir,

          Can you put up a final patch, status for this?

          Show
          Grant Ingersoll added a comment - Sisir, Can you put up a final patch, status for this?
          Hide
          Sean Owen added a comment -

          Yes happy to commit your final patch. I think we need that to finish GSoC. And, we'd need it to get into 0.4 if it's going in, soon.

          Show
          Sean Owen added a comment - Yes happy to commit your final patch. I think we need that to finish GSoC. And, we'd need it to get into 0.4 if it's going in, soon.
          Hide
          Sean Owen added a comment -

          No plans for this to be completed at the moment; GSoC project did not complete.

          Show
          Sean Owen added a comment - No plans for this to be completed at the moment; GSoC project did not complete.
          Hide
          Dirk Weißenborn added a comment - - edited

          I am interested in writing a patch for this problem, but I ve got a few questions to mahout:
          I am not quite sure where to put some of the important classes.
          isn t there a package which contains models that can be used throughout mahout, i mean for example neural networks (e.g. RBMs in different compositions) have more applications than just classification, so it seems a little redundant writing classes of these basic models in each package?
          Take for example a class Neuron (and all subclasses of that), which should be used by all neural network implementation. where is the proper place for this class?
          If I would use some class of a classification package in a clustering class for example, how should this be done properly?

          Show
          Dirk Weißenborn added a comment - - edited I am interested in writing a patch for this problem, but I ve got a few questions to mahout: I am not quite sure where to put some of the important classes. isn t there a package which contains models that can be used throughout mahout, i mean for example neural networks (e.g. RBMs in different compositions) have more applications than just classification, so it seems a little redundant writing classes of these basic models in each package? Take for example a class Neuron (and all subclasses of that), which should be used by all neural network implementation. where is the proper place for this class? If I would use some class of a classification package in a clustering class for example, how should this be done properly?
          Hide
          Grant Ingersoll added a comment -

          Hi Dirk,

          I think it makes sense to have some common areas, but perhaps you don't need to worry about it just yet unless you see places where it can be immediately used. I don't think we have any other neural network implementations at the moment (well, simple perceptron, I guess). I guess I would start with what you think is best and then we can iterate. It's often easier to discuss a concrete patch.

          Show
          Grant Ingersoll added a comment - Hi Dirk, I think it makes sense to have some common areas, but perhaps you don't need to worry about it just yet unless you see places where it can be immediately used. I don't think we have any other neural network implementations at the moment (well, simple perceptron, I guess). I guess I would start with what you think is best and then we can iterate. It's often easier to discuss a concrete patch.
          Hide
          Dirk Weißenborn added a comment -

          Ok,
          in that case i am going to give it a try as a recommender system at first as it was intended, and later on I could extend this to a classifier. anyway, i'll just gonna start playing around with mahout and this problem and see what i can do.

          Show
          Dirk Weißenborn added a comment - Ok, in that case i am going to give it a try as a recommender system at first as it was intended, and later on I could extend this to a classifier. anyway, i'll just gonna start playing around with mahout and this problem and see what i can do.

            People

            • Assignee:
              Grant Ingersoll
              Reporter:
              Sisir Koppaka
            • Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development