Mahout
  1. Mahout
  2. MAHOUT-627

Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

    Details

      Description

      Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

      Student Name: Dhruv Kumar

      Student E-mail: dkumar@ecs.umass.edu

      Organization/Project: Apache Mahout

      Assigned Mentor:

      Proposal Abstract:

      The Baum-Welch algorithm is commonly used for training a Hidden Markov Model because of its superior numerical stability and its ability to guarantee the discovery of a locally maximum, Maximum Likelihood Estimator, in the presence of incomplete training data. Currently, Apache Mahout has a sequential implementation of the Baum-Welch which cannot be scaled to train over large data sets. This restriction reduces the quality of training and constrains generalization of the learned model when used for prediction. This project proposes to extend Mahout's Baum-Welch to a parallel, distributed version using the Map-Reduce programming framework for enhanced model fitting over large data sets.

      Detailed Description:

      Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool for applications generating temporal or spatial sequential data. Relative simplicity of implementation, combined with their ability to discover latent domain knowledge have made them very popular in diverse fields such as DNA sequence alignment, gene discovery, handwriting analysis, voice recognition, computer vision, language translation and parts-of-speech tagging.

      A HMM is defined as a tuple (S, O, Theta) where S is a finite set of unobservable, hidden states emitting symbols from a finite observable vocabulary set O according to a probabilistic model Theta. The parameters of the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic transition matrix of the hidden states of size |S| X |S|. The elements a_(i,j) of A specify the probability of transitioning from a state i to state j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose elements b_(s, o) provide the probability that a symbol o will be emitted from the hidden state s. The elements pi_(s) of the |S| length vector Pi determine the probability that the system starts in the hidden state s. The transitions of hidden states are unobservable and follow the Markov property of memorylessness.

      Rabiner [1] defined three main problems for HMMs:

      1. Evaluation: Given the complete model (S, O, Theta) and a subset of the observation sequence, determine the probability that the model generated the observed sequence. This is useful for evaluating the quality of the model and is solved using the so called Forward algorithm.

      2. Decoding: Given the complete model (S, O, Theta) and an observation sequence, determine the hidden state sequence which generated the observed sequence. This can be viewed as an inference problem where the model and observed sequence are used to predict the value of the unobservable random variables. The backward algorithm, also known as the Viterbi decoding algorithm is used for predicting the hidden state sequence.

      3. Training: Given the set of hidden states S, the set of observation vocabulary O and the observation sequence, determine the parameters (A, B, Pi) of the model Theta. This problem can be viewed as a statistical machine learning problem of model fitting to a large set of training data. The Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and the Viterbi training algorithm are commonly used for model fitting.

      In general, the quality of HMM training can be improved by employing large training vectors but currently, Mahout only supports sequential versions of HMM trainers which are incapable of scaling. Among the Viterbi and the Baum-Welch training methods, the Baum-Welch algorithm is superior, accurate, and a better candidate for a parallel implementation for two reasons:
      (1) The BW is numerically stable and provides a guaranteed discovery of the locally maximum, Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In Viterbi training, the MLE is approximated in order to reduce computation time.
      (2) The BW belongs to the general class of Expectation Maximization (EM) algorithms which naturally fit into the Map-Reduce framework [2], such as the existing Map Reduce implementation of k-means in Mahout.

      Hence, this project proposes to extend Mahout's current sequential implementation of the Baum-Welch HMM trainer to a scalable, distributed case. Since the distributed version of the BW will use the sequential implementations of the Forward and the Backward algorithms to compute the alpha and the beta factors in each iteration, a lot of existing HMM training code will be reused. Specifically, the parallel implementation of the BW algorithm on Map Reduce has been elaborated at great length in [3] by viewing it as a specific case of the Expectation-Maximization algorithm and will be followed for implementation in this project.

      The BW EM algorithm iteratively refines the model's parameters and consists of two distinct steps in each iteration--Expectation and Maximization. In the distributed case, the Expectation step is computed by the mappers and the reducers, while the Maximization is handled by the reducers. Starting from an initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is input to the algorithm, and the end result Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a user specified convergence condition expressed as a fixpoint or when the number of iterations exceeds a user defined value.

      Expectation computes the posterior probability of each latent variable for each observed variable, weighed by the relative frequency of the observed variable in the input split. The mappers process independent training instances and emit expected state transition and emission counts using the Forward and Backward algorithms. The reducers finish Expectation by aggregating the expected counts. The input to a mapper consists of (k, v_o) pairs where k is a unique key and v_o is a string of observed symbols. For each training instance, the mappers emit the same set of keys corresponding to the following three optimization problems to be solved during the Maximization, and their values in a hash-map:
      (1) Expected number of times a hidden state is reached (Pi).
      (2) Number of times each observable symbol is generated by each hidden state (B).
      (3) Number of transitions between each pair of states in the hidden state space (A).

      The M step computes the updated Theta(i+1) from the values generated during the E part. This involves aggregating the values (as hash-maps) for each key corresponding to one of the optimization problems. The aggregation summarizes the statistics necessary to compute a subset of the parameters for the next EM iteration. The optimal parameters for the next iteration are arrived by computing the relative frequency of each event with respect to its expected count at the current iteration. The emitted optimal parameters by each reducer are written to the HDFS and are fed to the mappers in the next iteration.

      The project can be subdivided into distinct tasks of programming, testing and documenting the driver, mapper, reducer and the combiner with the Expectation and Maximization parts split between them. For each of these tasks, a new class will be programmed, unit tested and documented within the org.apache.mahout.classifier.sequencelearning.hmm package. Since k-means is also an EM algorithm, particular attention will be paid to its code at each step for possible reuse.

      A list of milestones, associated deliverable and high level implementation details is given below.

      Time-line: April 26 - Aug 15.

      Milestones:

      April 26 - May 22 (4 weeks): Pre-coding stage. Open communication with my mentor, refine the project's plan and requirements, understand the community's code styling requirements, expand the knowledge on Hadoop and Mahout internals. Thoroughly familiarize with the classes within the classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and math packages.

      May 23 - June 3 (2 weeks): Work on Driver. Implement, test and document the class HmmDriver by extending the AbstractJob class and by reusing the code from the KMeansDriver class.

      June 3 - July 1 (4 weeks): Work on Mapper. Implement, test and document the class HmmMapper. The HmmMapper class will include setup() and map() methods. The setup() method will read in the HmmModel and the parameter values obtained from the previous iteration. The map() method will call the HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() and complete the Expectation step partially.

      July 1 - July 15 (2 weeks): Work on Reducer. Implement, test and document the class HmmReducer. The reducer will complete the Expectation step by summing over all the occurences emitted by the mappers for the three optimization problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if possible. Also, mid-term review.

      July 15 - July 29 (2 weeks): Work on Combiner. Implement, test and document the class HmmCombiner. The combiner will reduce the network traffic and improve efficiency by aggregating the values for each of the three keys corresponding to each of the optimization problems for the Maximization stage in reducers. Look at the possibility of code reuse from the KMeansCombiner class.

      July 29 - August 15 (2 weeks): Final touches. Test the mapper, reducer, combiner and driver together. Give an example demonstrating the new parallel BW algorithm by employing the parts-of-speech tagger data set also used by the sequential BW [4]. Tidy up code and fix loose ends, finish wiki documentation.

      Additional Information:

      I am in the final stages of finishing my Master's degree in Electrical and Computer Engineering from the University of Massachusetts Amherst. Working under the guidance of Prof. Wayne Burleson, as part of my Master's research work, I have applied the theory of Markov Decision Process (MDP) to increase the duration of service of mobile computers. This semester I am involved with two course projects involving machine learning over large data sets. In the Bioinformatics class, I am mining the RCSB Protein Data Bank [5] to learn the dependence of side chain geometry on a protein's secondary structure, and comparing it with the Dynamic Bayesian Network approach used in [6]. In another project for the Online Social Networks class, I am using reinforcement learning to build an online recommendation system by reformulating MDP optimal policy search as an EM problem [7] and employing Map Reduce (extending Mahout) to arrive at it in a scalable, distributed manner.
      I owe much to the open source community as all my research experiments have only been possible due to the freely available Linux distributions, performance analyzers, scripting languages and associated documentation. After joining the Apache Mahout's developer mailing list a few weeks ago, I have found the community extremely vibrant, helpful and welcoming. If selected, I feel that the GSOC 2011 project will be a great learning experience for me from both a technical and professional standpoint and will also allow me to contribute within my modest means to the overall spirit of open source programming and Machine Learning.

      References:

      [1] A tutorial on hidden markov models and selected applications in speech recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 (1989), pp. 257-286.

      [2] Map-Reduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. In NIPS (2006), pp. 281-288.

      [3] Data-Intensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. Morgan & Claypool 2010.

      [4] http://flexcrfs.sourceforge.net/#Case_Study

      [5] http://www.rcsb.org/pdb/home/home.do

      [6] Beyond rotamers: a generative, probabilistic model of side chains in proteins by Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE, Hamelryck T. BMC Bioinformatics. 2010 Jun 5.

      [7] Probabilistic inference for solving discrete and continuous state Markov Decision Processes by M. Toussaint and A. Storkey. ICML, 2006.

      1. MAHOUT-627.patch
        96 kB
        Dhruv Kumar
      2. ASF.LICENSE.NOT.GRANTED--screenshot.png
        288 kB
        Chenghao Liu
      3. MAHOUT-627.patch
        105 kB
        Grant Ingersoll
      4. MAHOUT-627.patch
        2 kB
        Grant Ingersoll
      5. MAHOUT-627.patch
        97 kB
        Dhruv Kumar
      6. MAHOUT-627.patch
        95 kB
        Grant Ingersoll
      7. MAHOUT-627.patch
        95 kB
        Dhruv Kumar
      8. MAHOUT-627.patch
        68 kB
        Dhruv Kumar
      9. MAHOUT-627.patch
        68 kB
        Dhruv Kumar
      10. MAHOUT-627.patch
        60 kB
        Dhruv Kumar
      11. MAHOUT-627.patch
        50 kB
        Dhruv Kumar
      12. MAHOUT-627.patch
        43 kB
        Dhruv Kumar

        Activity

        Sebastian Schelter made changes -
        Fix Version/s Backlog [ 12318886 ]
        Sebastian Schelter made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Won't Fix [ 2 ]
        Hide
        Suneel Marthi added a comment -

        Moving this to Backlog per email from Grant.

        Show
        Suneel Marthi added a comment - Moving this to Backlog per email from Grant.
        Suneel Marthi made changes -
        Fix Version/s Backlog [ 12318886 ]
        Fix Version/s 0.9 [ 12324577 ]
        Hide
        Suneel Marthi added a comment -

        Hi Dhruv,

        Will this be ready for 0.9 (tentatively mid-November)? It would be great if you could address Grant's comments above.

        Show
        Suneel Marthi added a comment - Hi Dhruv, Will this be ready for 0.9 (tentatively mid-November)? It would be great if you could address Grant's comments above.
        Hide
        Grant Ingersoll added a comment -

        Dhruv,

        Any chance this can get done?

        Show
        Grant Ingersoll added a comment - Dhruv, Any chance this can get done?
        Grant Ingersoll made changes -
        Fix Version/s 0.9 [ 12324577 ]
        Fix Version/s 0.8 [ 12320153 ]
        Hide
        Grant Ingersoll added a comment -

        Hi Dhruv,

        Thanks for the response. We are trying to get 0.8 in the next week or two. Any help on a short example as well as updating the code to trunk would be awesome.

        Thanks,
        Grant

        Show
        Grant Ingersoll added a comment - Hi Dhruv, Thanks for the response. We are trying to get 0.8 in the next week or two. Any help on a short example as well as updating the code to trunk would be awesome. Thanks, Grant
        Hide
        Dhruv Kumar added a comment -

        Hi Grant,

        As I understand the only blocker for this issue is a small, self contained example which the users can run in a reasonable amount of time and see the results. The parts of speech tagger example which I originally adapted for this trainer can take hours to converge, and sometimes it fails with arithmetic underflow due to an unusually large set of states for the Observations (observed states are the words of the corpus in the POS tagger's model).

        When is 0.8 due? I can chip away on this issue for the next few days in the evenings and hunt for a short example from the book mentioned above. Should require a week or two at least to sign off from my side.

        There are also unit tests with the trainer which demonstrate that it works--the results of Map Reduce based training are identical to the ones obtained in the sequential version.

        Show
        Dhruv Kumar added a comment - Hi Grant, As I understand the only blocker for this issue is a small, self contained example which the users can run in a reasonable amount of time and see the results. The parts of speech tagger example which I originally adapted for this trainer can take hours to converge, and sometimes it fails with arithmetic underflow due to an unusually large set of states for the Observations (observed states are the words of the corpus in the POS tagger's model). When is 0.8 due? I can chip away on this issue for the next few days in the evenings and hunt for a short example from the book mentioned above. Should require a week or two at least to sign off from my side. There are also unit tests with the trainer which demonstrate that it works--the results of Map Reduce based training are identical to the ones obtained in the sequential version.
        Hide
        Grant Ingersoll added a comment -

        Dhruv, can you update by chance? Otherwise, will remove this from 0.8.

        Show
        Grant Ingersoll added a comment - Dhruv, can you update by chance? Otherwise, will remove this from 0.8.
        Hide
        Grant Ingersoll added a comment -

        Cool, thanks, Dhruv. Any luck on the examples?

        Show
        Grant Ingersoll added a comment - Cool, thanks, Dhruv. Any luck on the examples?
        Dhruv Kumar made changes -
        Attachment MAHOUT-627.patch [ 12544947 ]
        Hide
        Dhruv Kumar added a comment -

        Bringing up to date with trunk by accounting for the new driver.classes.default.props file

        Show
        Dhruv Kumar added a comment - Bringing up to date with trunk by accounting for the new driver.classes.default.props file
        Hide
        Dhruv Kumar added a comment -

        Hi Tim,

        Thanks a lot for trying out my patch and providing these examples. Please let me know if you need any help or clarification about the API. Like I've mentioned above, I need a good example to demonstrate the capability so I'll look at your link to see if it fits the need here.

        Show
        Dhruv Kumar added a comment - Hi Tim, Thanks a lot for trying out my patch and providing these examples. Please let me know if you need any help or clarification about the API. Like I've mentioned above, I need a good example to demonstrate the capability so I'll look at your link to see if it fits the need here.
        Hide
        Tim Schultz added a comment -

        Hi all - hoping I can revive this discussion

        I'm not yet well-versed with Mahout especially applied to HMM, but find the idea quite interesting - especially since I have massive amounts of data that would be ideal for it.

        I'm more familiar with R, and am knowledgeable about some example datasets that can be used for testing (below). I've applied Dhruv's patch and currently rebuilding Mahout. I will see if I can get some of these examples working on my local Hadoop instance, but there will be a slight learning curve.

        http://134.76.173.220/hmm-with-r/data/

        README: http://134.76.173.220/hmm-with-r/data/00_README.txt

        Thoughts?

        Hope this helps.

        -Tim

        Show
        Tim Schultz added a comment - Hi all - hoping I can revive this discussion I'm not yet well-versed with Mahout especially applied to HMM, but find the idea quite interesting - especially since I have massive amounts of data that would be ideal for it. I'm more familiar with R, and am knowledgeable about some example datasets that can be used for testing (below). I've applied Dhruv's patch and currently rebuilding Mahout. I will see if I can get some of these examples working on my local Hadoop instance, but there will be a slight learning curve. http://134.76.173.220/hmm-with-r/data/ README: http://134.76.173.220/hmm-with-r/data/00_README.txt Thoughts? Hope this helps. -Tim
        Hide
        Grant Ingersoll added a comment -

        Does anyone else have a recommendation for a HMM training example I can use?

        Perhaps look at other HMM packages?

        Show
        Grant Ingersoll added a comment - Does anyone else have a recommendation for a HMM training example I can use? Perhaps look at other HMM packages?
        Hide
        Dhruv Kumar added a comment -

        Sorry for being MIA for a while. I have relocated to SF and was extremely busy coming up to speed with my new job. That being said, I do want to work on this, maintain it and make sure that this feature makes it to Mahout's trunk.

        This example is not entirely suitable for demonstrating the MR version of HMM training. The state space is very large which causes underflow very easily.

        I'm searching for a good example for this feature.

        Does anyone else have a recommendation for a HMM training example I can use?

        Show
        Dhruv Kumar added a comment - Sorry for being MIA for a while. I have relocated to SF and was extremely busy coming up to speed with my new job. That being said, I do want to work on this, maintain it and make sure that this feature makes it to Mahout's trunk. This example is not entirely suitable for demonstrating the MR version of HMM training. The state space is very large which causes underflow very easily. I'm searching for a good example for this feature. Does anyone else have a recommendation for a HMM training example I can use?
        Chenghao Liu made changes -
        Attachment screenshot.png [ 12526870 ]
        Grant Ingersoll made changes -
        Fix Version/s 0.8 [ 12320153 ]
        Fix Version/s 0.7 [ 12319261 ]
        Hide
        Grant Ingersoll added a comment -

        Still think this is useful, but we need a fix for the example. Marking for 0.8

        Show
        Grant Ingersoll added a comment - Still think this is useful, but we need a fix for the example. Marking for 0.8
        Hide
        Grant Ingersoll added a comment -

        Dhruv, any luck on the example issue?

        Show
        Grant Ingersoll added a comment - Dhruv, any luck on the example issue?
        Grant Ingersoll made changes -
        Fix Version/s 0.7 [ 12319261 ]
        Fix Version/s 0.6 [ 12316364 ]
        Hide
        Grant Ingersoll added a comment -

        Marking this as 0.7, as much as I would love to get it in for 0.6.

        Show
        Grant Ingersoll added a comment - Marking this as 0.7, as much as I would love to get it in for 0.6.
        Hide
        Grant Ingersoll added a comment -

        Once I get past the example issue, I think this is ready to go for the most part.

        Show
        Grant Ingersoll added a comment - Once I get past the example issue, I think this is ready to go for the most part.
        Grant Ingersoll made changes -
        Attachment MAHOUT-627.patch [ 12507850 ]
        Hide
        Grant Ingersoll added a comment -

        Converts the trainer to be an AbstractJob, brings up to trunk.

        Show
        Grant Ingersoll added a comment - Converts the trainer to be an AbstractJob, brings up to trunk.
        Hide
        Grant Ingersoll added a comment -

        I'm getting

        1/12/18 17:21:02 WARN mapred.LocalJobRunner: job_local_0010
        java.lang.IllegalArgumentException: The output state probability from hidden state 0 to output state 2 is negative
        at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
        at org.apache.mahout.classifier.sequencelearning.hmm.HmmUtils.validate(HmmUtils.java:187)
        at org.apache.mahout.classifier.sequencelearning.hmm.hadoop.BaumWelchMapper.setup(BaumWelchMapper.java:111)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
        11/12/18 17:21:03 INFO mapred.JobClient: map 0% reduce 0%
        11/12/18 17:21:03 INFO mapred.JobClient: Job complete: job_local_0010
        11/12/18 17:21:03 INFO mapred.JobClient: Counters: 0
        Exception in thread "main" java.lang.InterruptedException: Baum-Welch Iteration failed processing tmp/output/model-9
        at org.apache.mahout.classifier.sequencelearning.hmm.hadoop.BaumWelchDriver.runIteration(BaumWelchDriver.java:315)
        at org.apache.mahout.classifier.sequencelearning.hmm.hadoop.BaumWelchDriver.runBaumWelchMR(BaumWelchDriver.java:253)
        at org.apache.mahout.classifier.sequencelearning.hmm.hadoop.BWPosTagger.trainModel(BWPosTagger.java:293)
        at org.apache.mahout.classifier.sequencelearning.hmm.hadoop.BWPosTagger.main(BWPosTagger.java:364)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
        at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)

        when running the example. Otherwise, all tests pass.

        Show
        Grant Ingersoll added a comment - I'm getting 1/12/18 17:21:02 WARN mapred.LocalJobRunner: job_local_0010 java.lang.IllegalArgumentException: The output state probability from hidden state 0 to output state 2 is negative at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) at org.apache.mahout.classifier.sequencelearning.hmm.HmmUtils.validate(HmmUtils.java:187) at org.apache.mahout.classifier.sequencelearning.hmm.hadoop.BaumWelchMapper.setup(BaumWelchMapper.java:111) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) 11/12/18 17:21:03 INFO mapred.JobClient: map 0% reduce 0% 11/12/18 17:21:03 INFO mapred.JobClient: Job complete: job_local_0010 11/12/18 17:21:03 INFO mapred.JobClient: Counters: 0 Exception in thread "main" java.lang.InterruptedException: Baum-Welch Iteration failed processing tmp/output/model-9 at org.apache.mahout.classifier.sequencelearning.hmm.hadoop.BaumWelchDriver.runIteration(BaumWelchDriver.java:315) at org.apache.mahout.classifier.sequencelearning.hmm.hadoop.BaumWelchDriver.runBaumWelchMR(BaumWelchDriver.java:253) at org.apache.mahout.classifier.sequencelearning.hmm.hadoop.BWPosTagger.trainModel(BWPosTagger.java:293) at org.apache.mahout.classifier.sequencelearning.hmm.hadoop.BWPosTagger.main(BWPosTagger.java:364) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188) when running the example. Otherwise, all tests pass.
        Grant Ingersoll made changes -
        Attachment MAHOUT-627.patch [ 12506598 ]
        Hide
        Grant Ingersoll added a comment -

        Brings up to trunk

        Show
        Grant Ingersoll added a comment - Brings up to trunk
        Hide
        Suneel Marthi added a comment -

        While reviewing the code in BaumWelchTrainer.java, noticed that we have a bunch of System.out.println() statements. Code needs some cleanup to replace these by SLF4j logger calls.

        Show
        Suneel Marthi added a comment - While reviewing the code in BaumWelchTrainer.java, noticed that we have a bunch of System.out.println() statements. Code needs some cleanup to replace these by SLF4j logger calls.
        Hide
        Grant Ingersoll added a comment -

        I'm going to look to commit this soon after ApacheCon (or perhaps during)

        Show
        Grant Ingersoll added a comment - I'm going to look to commit this soon after ApacheCon (or perhaps during)
        Hide
        Dhruv Kumar added a comment -

        Hi Grant,

        Sorry I was caught up with the job interviews and turning in the graduation documents.

        Here is the patch with the changes which you wanted:

        1. Created a new shell script which automatically downloads the training and test sets. If the sets are already present, it skips the download.

        2. Modified the POS tagger example code to avoid the download and accept the training and test sets via command line arguments. These arguments are passed by the script in #1.

        Please let me know if you need further changes to make it commit ready!

        Show
        Dhruv Kumar added a comment - Hi Grant, Sorry I was caught up with the job interviews and turning in the graduation documents. Here is the patch with the changes which you wanted: 1. Created a new shell script which automatically downloads the training and test sets. If the sets are already present, it skips the download. 2. Modified the POS tagger example code to avoid the download and accept the training and test sets via command line arguments. These arguments are passed by the script in #1. Please let me know if you need further changes to make it commit ready!
        Dhruv Kumar made changes -
        Attachment MAHOUT-627.patch [ 12493862 ]
        Hide
        Grant Ingersoll added a comment -

        Dhruv, any progress on the last pieces here? I'd like to see this get committed relatively soon.

        Show
        Grant Ingersoll added a comment - Dhruv, any progress on the last pieces here? I'd like to see this get committed relatively soon.
        Hide
        Dhruv Kumar added a comment -

        Hi Grant,

        Thanks for the feedback and for fixing the code style.

        I will create a script for the example and have it download the test data only if it is not already present.

        In your testing, if you come across any corner case which has missed my testing, please let me know. I can add a test for it and refactor the code to eliminate the bug. I am traveling until Saturday for a job interview in Seattle but I should be able to roll out the patch soon after that!

        Show
        Dhruv Kumar added a comment - Hi Grant, Thanks for the feedback and for fixing the code style. I will create a script for the example and have it download the test data only if it is not already present. In your testing, if you come across any corner case which has missed my testing, please let me know. I can add a test for it and refactor the code to eliminate the bug. I am traveling until Saturday for a job interview in Seattle but I should be able to roll out the patch soon after that!
        Grant Ingersoll made changes -
        Attachment MAHOUT-627.patch [ 12491473 ]
        Hide
        Grant Ingersoll added a comment -

        Some minor changes to move the packaging around to be a bit more consistent w/ the rest of Mahout (I think). Also some minor other tweaks in style.

        Dhruv, for the example, I think it would be good to have a shell script to run just like the other examples. Also, it should probably move the downloading of the test/train data out to that script (and only do it if it isn't already there.)

        I am still reviewing the algorithm itself, but it looks pretty good and seems consistent with our sequential implementation.

        Show
        Grant Ingersoll added a comment - Some minor changes to move the packaging around to be a bit more consistent w/ the rest of Mahout (I think). Also some minor other tweaks in style. Dhruv, for the example, I think it would be good to have a shell script to run just like the other examples. Also, it should probably move the downloading of the test/train data out to that script (and only do it if it isn't already there.) I am still reviewing the algorithm itself, but it looks pretty good and seems consistent with our sequential implementation.
        Hide
        Dhruv Kumar added a comment - - edited

        In its current form, one should follow the following steps to use the trainer:

        1) Encode the hidden and emitted state tags to integer ids and store them as a MapWritable SequenceFiles. The MapWritableCache.Save() method comes handy here.
        2) Convert the input sequence to the integer ids as described by the emitted states map in step 1. Wrap the input sequence as a VectorWritable.
        3) Store the VectorWritable obtained in step 2 as a sequence file containing key as an arbitrary LongWritable, and the Value as the integer sequence. This forms the input to the trainer.
        4) (Optional) Use the BaumWelchUtils.BuildHmmModelFromDistributions() method to store an initial model with given distributions. The model will be stored as a SequenceFile containing MapWritables as the distributions.
        4) Invoke the trainer via the command line or using the API by calling the driver's run() method. The users can specify if they want to use the log scaled variant by setting the logScaled to true, and they can specify the convergence delta, the max number of iterations etc. In case step 4 is omitted, the users must ask the program to create a random initial model by setting the buildRandom to true. This starts the iterative training using Maximum Likelihood Estimation.
        5) At the end, as the result of the training, a HmmModel is stored as a MapWritable with probability distributions encoded as DoubleWritables. The utility method BaumWelchUtils.CreateHmmModel(Path) can be used to decode the result and obtain the HmmModel.

        Design Discussion

        The design uses MapWritables and SequenceFiles to freely convert between the legacy HmmModel to a serializable varaint which also encodes the probability distributions. This design choice had the following advantages:

        1 I could leverage a lot of existing functionality of the legacy sequential Hmm code by writing utility methods to encode and decode (BaumWelchUtils class was made for this purpose).

        2. The users ultimately get the legacy HMM Model at the end of step 5, they can then use it to decode the test sequence using HmmAlgorithms.decodeSequence() method. They also have at their disposal all the other methods provided by the legacy code.

        3. Since the trained model is persisted in a SequenceFile, one can store these models for future reference and use the BaumWelchUtils.CreateHmmModel(Path) later to decode it and compare with other trained models (possibly with different initial seed values).

        Show
        Dhruv Kumar added a comment - - edited In its current form, one should follow the following steps to use the trainer: 1) Encode the hidden and emitted state tags to integer ids and store them as a MapWritable SequenceFiles. The MapWritableCache.Save() method comes handy here. 2) Convert the input sequence to the integer ids as described by the emitted states map in step 1. Wrap the input sequence as a VectorWritable. 3) Store the VectorWritable obtained in step 2 as a sequence file containing key as an arbitrary LongWritable, and the Value as the integer sequence. This forms the input to the trainer. 4) (Optional) Use the BaumWelchUtils.BuildHmmModelFromDistributions() method to store an initial model with given distributions. The model will be stored as a SequenceFile containing MapWritables as the distributions. 4) Invoke the trainer via the command line or using the API by calling the driver's run() method. The users can specify if they want to use the log scaled variant by setting the logScaled to true, and they can specify the convergence delta, the max number of iterations etc. In case step 4 is omitted, the users must ask the program to create a random initial model by setting the buildRandom to true. This starts the iterative training using Maximum Likelihood Estimation. 5) At the end, as the result of the training, a HmmModel is stored as a MapWritable with probability distributions encoded as DoubleWritables. The utility method BaumWelchUtils.CreateHmmModel(Path) can be used to decode the result and obtain the HmmModel. Design Discussion The design uses MapWritables and SequenceFiles to freely convert between the legacy HmmModel to a serializable varaint which also encodes the probability distributions. This design choice had the following advantages: 1 I could leverage a lot of existing functionality of the legacy sequential Hmm code by writing utility methods to encode and decode (BaumWelchUtils class was made for this purpose). 2. The users ultimately get the legacy HMM Model at the end of step 5, they can then use it to decode the test sequence using HmmAlgorithms.decodeSequence() method. They also have at their disposal all the other methods provided by the legacy code. 3. Since the trained model is persisted in a SequenceFile, one can store these models for future reference and use the BaumWelchUtils.CreateHmmModel(Path) later to decode it and compare with other trained models (possibly with different initial seed values).
        Hide
        Dhruv Kumar added a comment -

        First MapReduce based open-source Baum-Welch HMM Trainer!

        I have attached the Patch for inclusion into the trunk, keeping in line with the "firm pencils down date."

        It is complete with all the deliverables as listed in the project's timeline: unit tests, documentation, a POS example. Specifically, the following improvements have taken place in the last 5 days:

        1. Created a new Log scaled training variant by refactoring the mapper, combiner, reducer and driver classes (and added a unit test for the same). The option for log scaling can be invoked via the command line and causes the trainer to operate in log space between mapper -> combiner -> reducer. This should provide numerical stability for extremely long sequences or large state spaces (or both).

        2. Added a scalable Map-Reduce based Parts Of Speech tagger which uses the log scaled training.

        3. (Minor) Changed the input format from IntArrayWritable to Mahout's VectorWritable.

        It will be awesome to get some feedback on the code, functionality, design etc.

        I'm eager to keep improving the trainer, fix any bugs when they arise and making it more useful for users!

        Show
        Dhruv Kumar added a comment - First MapReduce based open-source Baum-Welch HMM Trainer! I have attached the Patch for inclusion into the trunk, keeping in line with the "firm pencils down date." It is complete with all the deliverables as listed in the project's timeline: unit tests, documentation, a POS example. Specifically, the following improvements have taken place in the last 5 days: 1. Created a new Log scaled training variant by refactoring the mapper, combiner, reducer and driver classes (and added a unit test for the same). The option for log scaling can be invoked via the command line and causes the trainer to operate in log space between mapper -> combiner -> reducer. This should provide numerical stability for extremely long sequences or large state spaces (or both). 2. Added a scalable Map-Reduce based Parts Of Speech tagger which uses the log scaled training. 3. (Minor) Changed the input format from IntArrayWritable to Mahout's VectorWritable. It will be awesome to get some feedback on the code, functionality, design etc. I'm eager to keep improving the trainer, fix any bugs when they arise and making it more useful for users!
        Dhruv Kumar made changes -
        Attachment MAHOUT-627.patch [ 12491218 ]
        Hide
        Dhruv Kumar added a comment -

        Hi Grant,

        I have uploaded the first candidate patch for this issue's resolution and it will be great to get some feedback on it from you and the dev community. It contains:

        1. Complete individual unit tests for the mapper, combiner, reducer to verify accurate summarization, normalization, probability matrices and vectors lengths.
        2. Unit tests for the overall trainer. The trained model's probability values are validated against the sequential HMM implementation of Mahout (which in turn used the R and Matlab HMM packages for validation).
        3. Documentation for each of the 8 classes under the new classifier.sequencelearning.baumwelchmapreduce package.
        4. Command line BaumWelch MapReduce training utilities--can be invoked using "bin/mahout hmmBaumWelchMapReduce <args>". The driver.classes.props file was modified for the same.

        On my system, which is an aging Pentium 4, the unit tests for baumwelchmapreduce took 57 seconds to complete.

        Please let me know what you think and where things can be improved. I will be refactoring this based on yours and others feedback until the firm pencils down date next week on Monday 22nd.

        Thank you.
        Dhruv

        Show
        Dhruv Kumar added a comment - Hi Grant, I have uploaded the first candidate patch for this issue's resolution and it will be great to get some feedback on it from you and the dev community. It contains: 1. Complete individual unit tests for the mapper, combiner, reducer to verify accurate summarization, normalization, probability matrices and vectors lengths. 2. Unit tests for the overall trainer. The trained model's probability values are validated against the sequential HMM implementation of Mahout (which in turn used the R and Matlab HMM packages for validation). 3. Documentation for each of the 8 classes under the new classifier.sequencelearning.baumwelchmapreduce package. 4. Command line BaumWelch MapReduce training utilities--can be invoked using "bin/mahout hmmBaumWelchMapReduce <args>". The driver.classes.props file was modified for the same. On my system, which is an aging Pentium 4, the unit tests for baumwelchmapreduce took 57 seconds to complete. Please let me know what you think and where things can be improved. I will be refactoring this based on yours and others feedback until the firm pencils down date next week on Monday 22nd. Thank you. Dhruv
        Dhruv Kumar made changes -
        Attachment MAHOUT-627.patch [ 12490564 ]
        Dhruv Kumar made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Grant Ingersoll added a comment -

        Hey Dhruv, nearing pencils down, how are we doing?

        Show
        Grant Ingersoll added a comment - Hey Dhruv, nearing pencils down, how are we doing?
        Hide
        Dhruv Kumar added a comment -

        Hi Grant,

        I am finishing up some documentation and a few tests. The unit testing has led to a lot of refactoring in the BaumWelchMapper and the BaumWelchUtils classes, which was somewhat expected. I should be able to wrap this up before the pencils down deadline though, with an example of POS tagging to follow.

        Show
        Dhruv Kumar added a comment - Hi Grant, I am finishing up some documentation and a few tests. The unit testing has led to a lot of refactoring in the BaumWelchMapper and the BaumWelchUtils classes, which was somewhat expected. I should be able to wrap this up before the pencils down deadline though, with an example of POS tagging to follow.
        Hide
        Grant Ingersoll added a comment -

        Dhruv,

        Any luck yet on the unit tests?

        Show
        Grant Ingersoll added a comment - Dhruv, Any luck yet on the unit tests?
        Hide
        Dhruv Kumar added a comment -

        Uploaded a new patch with refactorings and miscellaneous improvements. This concludes the chain's implementation and testing with manual inputs. The trainer works and provides a scalable variant of Baum Welch. Next phase of project will entail more testing of the chain via unit tests and implementation of the log-scaled variant.

        Next patch will contain more documentation and unit tests for some of the methods of the trainer.

        Show
        Dhruv Kumar added a comment - Uploaded a new patch with refactorings and miscellaneous improvements. This concludes the chain's implementation and testing with manual inputs. The trainer works and provides a scalable variant of Baum Welch. Next phase of project will entail more testing of the chain via unit tests and implementation of the log-scaled variant. Next patch will contain more documentation and unit tests for some of the methods of the trainer.
        Dhruv Kumar made changes -
        Attachment MAHOUT-627.patch [ 12486400 ]
        Hide
        Dhruv Kumar added a comment -

        Uploaded a new patch after a week's worth of testing:

        • Bug fixes for a few corner cases
        • Refactoring of the BaumWelchUtils and BaumWelchMapper class.
        • Added verbose loggers for debugging.
        Show
        Dhruv Kumar added a comment - Uploaded a new patch after a week's worth of testing: Bug fixes for a few corner cases Refactoring of the BaumWelchUtils and BaumWelchMapper class. Added verbose loggers for debugging.
        Dhruv Kumar made changes -
        Attachment MAHOUT-627.patch [ 12486067 ]
        Hide
        Dhruv Kumar added a comment -

        Thanks Sergey.

        I ended up creating a version of serialized HmmModel too. However, for the time being I'm not going to use it per my design which relies heavily on MapWritables for storing the distributions with a large key space equal to 2|S| +1, where |S| is the number of hidden states.

        The sequential command line utils are convenient for validating the results against the MapReduce variant so they are definitely useful in my case.

        Show
        Dhruv Kumar added a comment - Thanks Sergey. I ended up creating a version of serialized HmmModel too. However, for the time being I'm not going to use it per my design which relies heavily on MapWritables for storing the distributions with a large key space equal to 2|S| +1, where |S| is the number of hidden states. The sequential command line utils are convenient for validating the results against the MapReduce variant so they are definitely useful in my case.
        Hide
        Sergey Bartunov added a comment -

        Hey, Dhruv. I'd submitted some code in the https://issues.apache.org/jira/browse/MAHOUT-734 which contains HmmModel serialization utility and command-line tools for sequential HMM functionality and it could be integrated to your code.

        Show
        Sergey Bartunov added a comment - Hey, Dhruv. I'd submitted some code in the https://issues.apache.org/jira/browse/MAHOUT-734 which contains HmmModel serialization utility and command-line tools for sequential HMM functionality and it could be integrated to your code.
        Hide
        Dhruv Kumar added a comment -

        Uploaded a new patch:

        1. Added a new method for building a random HmmModel, wrapping the distribution matrices rows as MapWritable types, and writing them to the specified directory in a Sequence File format.

        2. Updated common/DefaultOptionCreator for the new option in #1. Also added an option for the user to specify the directory containing a pre-written HmmModel object (as a Sequence File type containing MapWritable).

        2. Updated the driver class for accomodating #1 and #2.

        Show
        Dhruv Kumar added a comment - Uploaded a new patch: 1. Added a new method for building a random HmmModel, wrapping the distribution matrices rows as MapWritable types, and writing them to the specified directory in a Sequence File format. 2. Updated common/DefaultOptionCreator for the new option in #1. Also added an option for the user to specify the directory containing a pre-written HmmModel object (as a Sequence File type containing MapWritable). 2. Updated the driver class for accomodating #1 and #2.
        Dhruv Kumar made changes -
        Attachment MAHOUT-627.patch [ 12484498 ]
        Hide
        Dhruv Kumar added a comment -

        Hi Grant,

        Since my last update, I have finished the first round of implementation of the end to end functionality resulting in 7 new classes, present under the classifier.sequencelearning.baumwelchmapreduce package. The attached patch contains the following:

        1. BaumWelchDriver, BaumWelchMapper, BaumWelchCombiner and BaumWelchReducer.
        2. MapWritableCache, a general class to load MapWritable files from the HDFS.
        3. BaumWelchUtils, a utility class for constructing the legacy HmmModel objects from a given HDFS directory containing the probability distributions (emission, transition and initial) as MapWritable types, stored as Sequence Files.
        4. BaumWelchModel, a serializable version of HmmModel.

        In addition to these classes, the patch also adds support for command line training of HMM using the BaumWelch MapReduce variant by modifying common/commandline/DefaultOptionCreator.java and the src/conf/driver.classes.props files.

        In order to maximize parallelization, a total of 2|S| + 1 keys are emitted where |S| is the number of hidden states. There are |S| keys, one for each hidden state with corresponding values containing expected transition counts as a MapWritable object. Similarly , there are |S| keys and corresponding MapWritable values encoding the emission probability distribution. Finally, there is one key with a MapWritable value encoding the initial probability distribution vector. The large key space permits recruitment of more number of reducers, with each key being processed separately by one reducer in the best case.

        The input and output formats are of Sequence File type. The keys for input are LongWritable and the values are ArrayWritable containing int[] observations where each int in the observed sequence is a mapping of the training set's tokens defined by the user. The reducers write the distributions as Sequence Files with keys of type Text and values as MapWritable. I have extensively re-used the current sequential variant of HMM training, and a lot of my design decisions w.r.t to input and output types were guided by the legacy code's API.

        Now that the driver-mapper-combiner-reducer chain's preliminary implementation is complete, the rest of the time will be actively spent in testing, debugging and refinement of the new trainer's features. In particular, I'm looking at alternative types to ArrayWritable for wrapping the observation sequence given to the mappers. For text mining, a simple mapping which encodes tokens in the text corpus as integer states could be performed, as expected by the current design. However, I'm not sure if this approach is the best w.r.t scalability or whether it is at all applicable to domains different from Information Retrieval requiring scalable HMM Training. I'm aware that a lot of other algorithms in Mahout require the input in the form of Vectors, packed into a Sequence File and it will be useful to get feedback on this issue.

        Show
        Dhruv Kumar added a comment - Hi Grant, Since my last update, I have finished the first round of implementation of the end to end functionality resulting in 7 new classes, present under the classifier.sequencelearning.baumwelchmapreduce package. The attached patch contains the following: 1. BaumWelchDriver, BaumWelchMapper, BaumWelchCombiner and BaumWelchReducer. 2. MapWritableCache, a general class to load MapWritable files from the HDFS. 3. BaumWelchUtils, a utility class for constructing the legacy HmmModel objects from a given HDFS directory containing the probability distributions (emission, transition and initial) as MapWritable types, stored as Sequence Files. 4. BaumWelchModel, a serializable version of HmmModel. In addition to these classes, the patch also adds support for command line training of HMM using the BaumWelch MapReduce variant by modifying common/commandline/DefaultOptionCreator.java and the src/conf/driver.classes.props files. In order to maximize parallelization, a total of 2|S| + 1 keys are emitted where |S| is the number of hidden states. There are |S| keys, one for each hidden state with corresponding values containing expected transition counts as a MapWritable object. Similarly , there are |S| keys and corresponding MapWritable values encoding the emission probability distribution. Finally, there is one key with a MapWritable value encoding the initial probability distribution vector. The large key space permits recruitment of more number of reducers, with each key being processed separately by one reducer in the best case. The input and output formats are of Sequence File type. The keys for input are LongWritable and the values are ArrayWritable containing int[] observations where each int in the observed sequence is a mapping of the training set's tokens defined by the user. The reducers write the distributions as Sequence Files with keys of type Text and values as MapWritable. I have extensively re-used the current sequential variant of HMM training, and a lot of my design decisions w.r.t to input and output types were guided by the legacy code's API. Now that the driver-mapper-combiner-reducer chain's preliminary implementation is complete, the rest of the time will be actively spent in testing, debugging and refinement of the new trainer's features. In particular, I'm looking at alternative types to ArrayWritable for wrapping the observation sequence given to the mappers. For text mining, a simple mapping which encodes tokens in the text corpus as integer states could be performed, as expected by the current design. However, I'm not sure if this approach is the best w.r.t scalability or whether it is at all applicable to domains different from Information Retrieval requiring scalable HMM Training. I'm aware that a lot of other algorithms in Mahout require the input in the form of Vectors, packed into a Sequence File and it will be useful to get feedback on this issue.
        Dhruv Kumar made changes -
        Attachment MAHOUT-627.patch [ 12483899 ]
        Dhruv Kumar made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Dhruv Kumar made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Grant Ingersoll added a comment -

        Hi Dhruv,

        How goes progress on this?

        Show
        Grant Ingersoll added a comment - Hi Dhruv, How goes progress on this?
        Sean Owen made changes -
        Assignee Grant Ingersoll [ gsingers ]
        Fix Version/s 0.6 [ 12316364 ]
        Affects Version/s 0.5 [ 12315255 ]
        Dhruv Kumar made changes -
        Description Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

        Student Name: Dhruv Kumar

        Student E-mail: dkumar@ecs.umass.edu

        Organization/Project: Apache Mahout

        Assigned Mentor:

        Proposal Abstract:

        The Baum-Welch algorithm is commonly used for training a Hidden Markov Model because of its superior numerical stability and its ability to guarantee the discovery of a locally maximum, Maximum Likelihood Estimator, in the presence of incomplete training data. Currently, Apache Mahout has a sequential implementation of the Baum-Welch which cannot be scaled to train over large data sets. This restriction reduces the quality of training and constrains generalization of the learned model when used for prediction. This project proposes to extend Mahout's Baum-Welch to a parallel, distributed version using the Map-Reduce programming framework for enhanced model fitting over large data sets.

        Detailed Description:

        Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool for applications generating temporal or spatial sequential data. Relative simplicity of implementation, combined with their ability to discover latent domain knowledge have made them very popular in diverse fields such as DNA sequence alignment, gene discovery, handwriting analysis, voice recognition, computer vision, language translation and parts-of-speech tagging.

        A HMM is defined as a tuple (S, O, Theta) where S is a finite set of unobservable, hidden states emitting symbols from a finite observable vocabulary set O according to a probabilistic model Theta. The parameters of the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic transition matrix of the hidden states of size |S| X |S|. The elements a_(i,j) of A specify the probability of transitioning from a state i to state j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose elements b_(s, o) provide the probability that a symbol o will be emitted from the hidden state s. The elements pi_(s) of the |S| length vector Pi determine the probability that the system starts in the hidden state s. The transitions of hidden states are unobservable and follow the Markov property of memorylessness.

        Rabiner [1] defined three main problems for HMMs:

        1. Evaluation: Given the complete model (S, O, Theta) and a subset of the observation sequence, determine the probability that the model generated the observed sequence. This is useful for evaluating the quality of the model and is solved using the so called Forward algorithm.

        2. Decoding: Given the complete model (S, O, Theta) and an observation sequence, determine the hidden state sequence which generated the observed sequence. This can be viewed as an inference problem where the model and observed sequence are used to predict the value of the unobservable random variables. The backward algorithm, also known as the Viterbi decoding algorithm is used for predicting the hidden state sequence.

        3. Training: Given the set of hidden states S, the set of observation vocabulary O and the observation sequence, determine the parameters (A, B, Pi) of the model Theta. This problem can be viewed as a statistical machine learning problem of model fitting to a large set of training data. The Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and the Viterbi training algorithm are commonly used for model fitting.

        In general, the quality of HMM training can be improved by employing large training vectors but currently, Mahout only supports sequential versions of HMM trainers which are incapable of scaling. Among the Viterbi and the Baum-Welch training methods, the Baum-Welch algorithm is superior, accurate, and a better candidate for a parallel implementation for two reasons:
        (1) The BW is numerically stable and provides a guaranteed discovery of the locally maximum, Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In Viterbi training, the MLE is approximated in order to reduce computation time.
        (2) The BW belongs to the general class of Expectation Maximization (EM) algorithms which naturally fit into the Map-Reduce framework [2], such as the existing Map Reduce implementation of k-means in Mahout.

        Hence, this project proposes to extend Mahout's current sequential implementation of the Baum-Welch HMM trainer to a scalable, distributed case. Since the distributed version of the BW will use the sequential implementations of the Forward and the Backward algorithms to compute the alpha and the beta factors in each iteration, a lot of existing HMM training code will be reused. Specifically, the parallel implementation of the BW algorithm on Map Reduce has been elaborated at great length in [3] by viewing it as a specific case of the Expectation-Maximization algorithm and will be followed for implementation in this project.

        The BW EM algorithm iteratively refines the model's parameters and consists of two distinct steps in each iteration--Expectation and Maximization. In the distributed case, the Expectation step is computed by the mappers and the reducers, while the Maximization is handled by the reducers. Starting from an initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is input to the algorithm, and the end result Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a user specified convergence condition expressed as a fixpoint or when the number of iterations exceeds a user defined value.

        Expectation computes the posterior probability of each latent variable for each observed variable, weighed by the relative frequency of the observed variable in the input split. The mappers process independent training instances and emit expected state transition and emission counts using the Forward and Backward algorithms. The reducers finish Expectation by aggregating the expected counts. The input to a mapper consists of (k, v_o) pairs where k is a unique key and v_o is a string of observed symbols. For each training instance, the mappers emit the same set of keys corresponding to the following three optimization problems to be solved during the Maximization, and their values in a hash-map:
        (1) Expected number of times a hidden state is reached (Pi).
        (2) Number of times each observable symbol is generated by each hidden state (B).
        (3) Number of transitions between each pair of states in the hidden state space (A).

        The M step computes the updated Theta(i+1) from the values generated during the E part. This involves aggregating the values (as hash-maps) for each key corresponding to one of the optimization problems. The aggregation summarizes the statistics necessary to compute a subset of the parameters for the next EM iteration. The optimal parameters for the next iteration are arrived by computing the relative frequency of each event with respect to its expected count at the current iteration. The emitted optimal parameters by each reducer are written to the HDFS and are fed to the mappers in the next iteration.

        The project can be subdivided into distinct tasks of programming, testing and documenting the driver, mapper, reducer and the combiner with the Expectation and Maximization parts split between them. For each of these tasks, a new class will be programmed, unit tested and documented within the org.apache.mahout.classifier.sequencelearning.hmm package. Since k-means is also an EM algorithm, particular attention will paid to its code at each step for possible reuse.

        A list of milestones, associated deliverable and high level implementation details is given below.

        Time-line: April 26 - Aug 15.

        Milestones:

        April 26 - May 22 (4 weeks): Pre-coding stage. Open communication with my mentor, refine the project's plan and requirements, understand the community's code styling requirements, expand the knowledge on Hadoop and Mahout internals. Thoroughly familiarize with the classes within the classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and math packages.

        May 23 - June 3 (2 weeks): Work on Driver. Implement, test and document the class HmmDriver by extending the AbstractJob class and by reusing the code from the KMeansDriver class.

        June 3 - July 1 (4 weeks): Work on Mapper. Implement, test and document the class HmmMapper. The HmmMapper class will include setup() and map() methods. The setup() method will read in the HmmModel and the parameter values obtained from the previous iteration. The map() method will call the HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() and complete the Expectation step partially.

        July 1 - July 15 (2 weeks): Work on Reducer. Implement, test and document the class HmmReducer. The reducer will complete the Expectation step by summing over all the occurences emitted by the mappers for the three optimization problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if possible. Also, mid-term review.

        July 15 - July 29 (2 weeks): Work on Combiner. Implement, test and document the class HmmCombiner. The combiner will reduce the network traffic and improve efficiency by aggregating the values for each of the three keys corresponding to each of the optimization problems for the Maximization stage in reducers. Look at the possibility of code reuse from the KMeansCombiner class.

        July 29 - August 15 (2 weeks): Final touches. Test the mapper, reducer, combiner and driver together. Give an example demonstrating the new parallel BW algorithm by employing the parts-of-speech tagger data set also used by the sequential BW [4]. Tidy up code and fix loose ends, finish wiki documentation.

        Additional Information:

        I am in the final stages of finishing my Master's degree in Electrical and Computer Engineering from the University of Massachusetts Amherst. Working under the guidance of Prof. Wayne Burleson, as part of my Master's research work, I have applied the theory of Markov Decision Process (MDP) to increase the duration of service of mobile computers. This semester I am involved with two course projects involving machine learning over large data sets. In the Bioinformatics class, I am mining the RCSB Protein Data Bank [5] to learn the dependence of side chain geometry on a protein's secondary structure, and comparing it with the Dynamic Bayesian Network approach used in [6]. In another project for the Online Social Networks class, I am using reinforcement learning to build an online recommendation system by reformulating MDP optimal policy search as an EM problem [7] and employing Map Reduce (extending Mahout) to arrive at it in a scalable, distributed manner.
        I owe much to the open source community as all my research experiments have only been possible due to the freely available Linux distributions, performance analyzers, scripting languages and associated documentation. After joining the Apache Mahout's developer mailing list a few weeks ago, I have found the community extremely vibrant, helpful and welcoming. If selected, I feel that the GSOC 2011 project will be a great learning experience for me from both a technical and professional standpoint and will also allow me to contribute within my modest means to the overall spirit of open source programming and Machine Learning.

        References:

        [1] A tutorial on hidden markov models and selected applications in speech recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 (1989), pp. 257-286.

        [2] Map-Reduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. In NIPS (2006), pp. 281-288.

        [3] Data-Intensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. Morgan & Claypool 2010.

        [4] http://flexcrfs.sourceforge.net/#Case_Study

        [5] http://www.rcsb.org/pdb/home/home.do

        [6] Beyond rotamers: a generative, probabilistic model of side chains in proteins by Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE, Hamelryck T. BMC Bioinformatics. 2010 Jun 5.

        [7] Probabilistic inference for solving discrete and continuous state Markov Decision Processes by M. Toussaint and A. Storkey. ICML, 2006.
        Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

        Student Name: Dhruv Kumar

        Student E-mail: dkumar@ecs.umass.edu

        Organization/Project: Apache Mahout

        Assigned Mentor:

        Proposal Abstract:

        The Baum-Welch algorithm is commonly used for training a Hidden Markov Model because of its superior numerical stability and its ability to guarantee the discovery of a locally maximum, Maximum Likelihood Estimator, in the presence of incomplete training data. Currently, Apache Mahout has a sequential implementation of the Baum-Welch which cannot be scaled to train over large data sets. This restriction reduces the quality of training and constrains generalization of the learned model when used for prediction. This project proposes to extend Mahout's Baum-Welch to a parallel, distributed version using the Map-Reduce programming framework for enhanced model fitting over large data sets.

        Detailed Description:

        Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool for applications generating temporal or spatial sequential data. Relative simplicity of implementation, combined with their ability to discover latent domain knowledge have made them very popular in diverse fields such as DNA sequence alignment, gene discovery, handwriting analysis, voice recognition, computer vision, language translation and parts-of-speech tagging.

        A HMM is defined as a tuple (S, O, Theta) where S is a finite set of unobservable, hidden states emitting symbols from a finite observable vocabulary set O according to a probabilistic model Theta. The parameters of the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic transition matrix of the hidden states of size |S| X |S|. The elements a_(i,j) of A specify the probability of transitioning from a state i to state j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose elements b_(s, o) provide the probability that a symbol o will be emitted from the hidden state s. The elements pi_(s) of the |S| length vector Pi determine the probability that the system starts in the hidden state s. The transitions of hidden states are unobservable and follow the Markov property of memorylessness.

        Rabiner [1] defined three main problems for HMMs:

        1. Evaluation: Given the complete model (S, O, Theta) and a subset of the observation sequence, determine the probability that the model generated the observed sequence. This is useful for evaluating the quality of the model and is solved using the so called Forward algorithm.

        2. Decoding: Given the complete model (S, O, Theta) and an observation sequence, determine the hidden state sequence which generated the observed sequence. This can be viewed as an inference problem where the model and observed sequence are used to predict the value of the unobservable random variables. The backward algorithm, also known as the Viterbi decoding algorithm is used for predicting the hidden state sequence.

        3. Training: Given the set of hidden states S, the set of observation vocabulary O and the observation sequence, determine the parameters (A, B, Pi) of the model Theta. This problem can be viewed as a statistical machine learning problem of model fitting to a large set of training data. The Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and the Viterbi training algorithm are commonly used for model fitting.

        In general, the quality of HMM training can be improved by employing large training vectors but currently, Mahout only supports sequential versions of HMM trainers which are incapable of scaling. Among the Viterbi and the Baum-Welch training methods, the Baum-Welch algorithm is superior, accurate, and a better candidate for a parallel implementation for two reasons:
        (1) The BW is numerically stable and provides a guaranteed discovery of the locally maximum, Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In Viterbi training, the MLE is approximated in order to reduce computation time.
        (2) The BW belongs to the general class of Expectation Maximization (EM) algorithms which naturally fit into the Map-Reduce framework [2], such as the existing Map Reduce implementation of k-means in Mahout.

        Hence, this project proposes to extend Mahout's current sequential implementation of the Baum-Welch HMM trainer to a scalable, distributed case. Since the distributed version of the BW will use the sequential implementations of the Forward and the Backward algorithms to compute the alpha and the beta factors in each iteration, a lot of existing HMM training code will be reused. Specifically, the parallel implementation of the BW algorithm on Map Reduce has been elaborated at great length in [3] by viewing it as a specific case of the Expectation-Maximization algorithm and will be followed for implementation in this project.

        The BW EM algorithm iteratively refines the model's parameters and consists of two distinct steps in each iteration--Expectation and Maximization. In the distributed case, the Expectation step is computed by the mappers and the reducers, while the Maximization is handled by the reducers. Starting from an initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is input to the algorithm, and the end result Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a user specified convergence condition expressed as a fixpoint or when the number of iterations exceeds a user defined value.

        Expectation computes the posterior probability of each latent variable for each observed variable, weighed by the relative frequency of the observed variable in the input split. The mappers process independent training instances and emit expected state transition and emission counts using the Forward and Backward algorithms. The reducers finish Expectation by aggregating the expected counts. The input to a mapper consists of (k, v_o) pairs where k is a unique key and v_o is a string of observed symbols. For each training instance, the mappers emit the same set of keys corresponding to the following three optimization problems to be solved during the Maximization, and their values in a hash-map:
        (1) Expected number of times a hidden state is reached (Pi).
        (2) Number of times each observable symbol is generated by each hidden state (B).
        (3) Number of transitions between each pair of states in the hidden state space (A).

        The M step computes the updated Theta(i+1) from the values generated during the E part. This involves aggregating the values (as hash-maps) for each key corresponding to one of the optimization problems. The aggregation summarizes the statistics necessary to compute a subset of the parameters for the next EM iteration. The optimal parameters for the next iteration are arrived by computing the relative frequency of each event with respect to its expected count at the current iteration. The emitted optimal parameters by each reducer are written to the HDFS and are fed to the mappers in the next iteration.

        The project can be subdivided into distinct tasks of programming, testing and documenting the driver, mapper, reducer and the combiner with the Expectation and Maximization parts split between them. For each of these tasks, a new class will be programmed, unit tested and documented within the org.apache.mahout.classifier.sequencelearning.hmm package. Since k-means is also an EM algorithm, particular attention will be paid to its code at each step for possible reuse.

        A list of milestones, associated deliverable and high level implementation details is given below.

        Time-line: April 26 - Aug 15.

        Milestones:

        April 26 - May 22 (4 weeks): Pre-coding stage. Open communication with my mentor, refine the project's plan and requirements, understand the community's code styling requirements, expand the knowledge on Hadoop and Mahout internals. Thoroughly familiarize with the classes within the classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and math packages.

        May 23 - June 3 (2 weeks): Work on Driver. Implement, test and document the class HmmDriver by extending the AbstractJob class and by reusing the code from the KMeansDriver class.

        June 3 - July 1 (4 weeks): Work on Mapper. Implement, test and document the class HmmMapper. The HmmMapper class will include setup() and map() methods. The setup() method will read in the HmmModel and the parameter values obtained from the previous iteration. The map() method will call the HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() and complete the Expectation step partially.

        July 1 - July 15 (2 weeks): Work on Reducer. Implement, test and document the class HmmReducer. The reducer will complete the Expectation step by summing over all the occurences emitted by the mappers for the three optimization problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if possible. Also, mid-term review.

        July 15 - July 29 (2 weeks): Work on Combiner. Implement, test and document the class HmmCombiner. The combiner will reduce the network traffic and improve efficiency by aggregating the values for each of the three keys corresponding to each of the optimization problems for the Maximization stage in reducers. Look at the possibility of code reuse from the KMeansCombiner class.

        July 29 - August 15 (2 weeks): Final touches. Test the mapper, reducer, combiner and driver together. Give an example demonstrating the new parallel BW algorithm by employing the parts-of-speech tagger data set also used by the sequential BW [4]. Tidy up code and fix loose ends, finish wiki documentation.

        Additional Information:

        I am in the final stages of finishing my Master's degree in Electrical and Computer Engineering from the University of Massachusetts Amherst. Working under the guidance of Prof. Wayne Burleson, as part of my Master's research work, I have applied the theory of Markov Decision Process (MDP) to increase the duration of service of mobile computers. This semester I am involved with two course projects involving machine learning over large data sets. In the Bioinformatics class, I am mining the RCSB Protein Data Bank [5] to learn the dependence of side chain geometry on a protein's secondary structure, and comparing it with the Dynamic Bayesian Network approach used in [6]. In another project for the Online Social Networks class, I am using reinforcement learning to build an online recommendation system by reformulating MDP optimal policy search as an EM problem [7] and employing Map Reduce (extending Mahout) to arrive at it in a scalable, distributed manner.
        I owe much to the open source community as all my research experiments have only been possible due to the freely available Linux distributions, performance analyzers, scripting languages and associated documentation. After joining the Apache Mahout's developer mailing list a few weeks ago, I have found the community extremely vibrant, helpful and welcoming. If selected, I feel that the GSOC 2011 project will be a great learning experience for me from both a technical and professional standpoint and will also allow me to contribute within my modest means to the overall spirit of open source programming and Machine Learning.

        References:

        [1] A tutorial on hidden markov models and selected applications in speech recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 (1989), pp. 257-286.

        [2] Map-Reduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. In NIPS (2006), pp. 281-288.

        [3] Data-Intensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. Morgan & Claypool 2010.

        [4] http://flexcrfs.sourceforge.net/#Case_Study

        [5] http://www.rcsb.org/pdb/home/home.do

        [6] Beyond rotamers: a generative, probabilistic model of side chains in proteins by Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE, Hamelryck T. BMC Bioinformatics. 2010 Jun 5.

        [7] Probabilistic inference for solving discrete and continuous state Markov Decision Processes by M. Toussaint and A. Storkey. ICML, 2006.
        Dhruv Kumar made changes -
        Description Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

        Student Name: Dhruv Kumar

        Student E-mail: dkumar@ecs.umass.edu

        Organization/Project: Apache Mahout

        Assigned Mentor:

        Proposal Abstract:

        The Baum-Welch algorithm is commonly used for training a Hidden Markov Model because of its superior numerical stability and its ability to guarantee the discovery of a locally maximum, Maximum Likelihood Estimator, in the presence of incomplete training data. Currently, Apache Mahout has a sequential implementation of the Baum-Welch which cannot be scaled to train over large data sets. This restriction reduces the quality of training and constrains generalization of the learned model when used for prediction. This project proposes to extend Mahout's sequential implementation of the Baum-Welch to a parallel, distributed version using the Map-Reduce programming framework to allow training at a large scale for enhanced model fitting.

        Detailed Description:

        Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool for applications generating temporal or spatial sequential data. Relative simplicity of implementation, combined with their ability to discover latent domain knowledge have made them very popular in diverse fields such as DNA sequence alignment, gene discovery, handwriting analysis, voice recognition, computer vision, language translation and parts-of-speech tagging.

        A HMM is defined as a tuple (S, O, Theta) where S is a finite set of unobservable, hidden states emitting symbols from a finite observable vocabulary set O according to a probabilistic model Theta. The parameters of the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic transition matrix of the hidden states of size |S| X |S|. The elements a_(i,j) of A specify the probability of transitioning from a state i to state j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose elements b_(s, o) provide the probability that a symbol o will be emitted from the hidden state s. The elements pi_(s) of the |S| length vector Pi determine the probability that the system starts in the hidden state s. The transitions of hidden states are unobservable and follow the Markov property of memorylessness.

        Rabiner [1] defined three main problems for HMMs:

        1. Evaluation: Given the complete model (S, O, Theta) and a subset of the observation sequence, determine the probability that the model generated the observed sequence. This is useful for evaluating the quality of the model and is solved using the so called Forward algorithm.

        2. Decoding: Given the complete model (S, O, Theta) and an observation sequence, determine the hidden state sequence which generated the observed sequence. This can be viewed as an inference problem where the model and observed sequence are used to predict the value of the unobservable random variables. The backward algorithm, also known as the Viterbi decoding algorithm is used for predicting the hidden state sequence.

        3. Training: Given the set of hidden states S, the set of observation vocabulary O and the observation sequence, determine the parameters (A, B, Pi) of the model Theta. This problem can be viewed as a statistical machine learning problem of model fitting to a large set of training data. The Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and the Viterbi training algorithm are commonly used for model fitting.

        In general, the quality of HMM training can be improved by employing large training vectors but currently, Mahout only supports sequential versions of HMM trainers which are incapable of scaling. Among the Viterbi and the Baum-Welch training methods, the Baum-Welch algorithm is superior, accurate, and a better candidate for a parallel implementation for two reasons:
        (1) The BW is numerically stable and provides a guaranteed discovery of the locally maximum, Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In Viterbi training, the MLE is approximated in order to reduce computation time.
        (2) The BW belongs to the general class of Expectation Maximization (EM) algorithms which naturally fit into the Map-Reduce framework [2].

        Hence, this project proposes to extend Mahout's current sequential implementation of the Baum-Welch HMM trainer to a scalable, distributed case. Since the distributed version of the BW will use the sequential implementations of the Forward and the Backward algorithms to compute the alpha and the beta factors in each iteration, a lot of existing HMM training code will be reused. Specifically, the parallel implementation of the BW algorithm on Map Reduce has been elaborated at great length in [3] by viewing it as a specific case of the Expectation-Maximization algorithm and will be followed for implementation in this project.

        The BW EM algorithm iteratively refines the model's parameters and consists of two distinct steps in each iteration--Expectation and Maximization. In the distributed case, the Expectation step is computed by the mappers and the reducers, while the Maximization is handled by the reducers. Starting from an initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is input to the algorithm, and the end result Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a user specified convergence condition expressed as a fixpoint or when the number of iterations exceeds a user defined value.

        Expectation computes the posterior probability of each latent variable for each observed variable, weighed by the relative frequency of the observed variable in the input split. The mappers process independent training instances and emit expected state transition and emission counts using the forward-backward algorithm. The reducers finish Expectation by aggregating the expected counts. The input to a mapper consists of (k, v_o) pairs where k is a unique key and v_o is a string of observed symbols. For each training instance, the mappers emit the same set of keys corresponding to the following three optimization problems to be solved during the Maximization, and their values in a hash-map:
        (1) Expected number of times a hidden state is reached (Pi).
        (2) Number of times each observable symbol is generated by each hidden state (B).
        (3) Number of transitions between each pair of states in the hidden state space (A).

        The M step computes the updated Theta(i+1) from the values generated during the E part. This involves aggregating the values (as hash-maps) for each key corresponding to one of the optimization problems. The aggregation summarizes the statistics necessary to compute a subset of the parameters for the next EM iteration. The optimal parameters for the next iteration are arrived by computing the relative frequency of each event with respect to its expected count at the current iteration. The emitted optimal parameters by each reducer are written to the HDFS and are fed to the mappers in the next iteration.

        The project can be subdivided into distinct tasks of programming, testing and documenting the driver, mapper, reducer and the combiner with the Expectation and Maximization parts split between them. For each of these tasks, a new class will be programmed, unit tested and documented within the org.apache.mahout.classifier.sequencelearning.hmm package. A list of milestones, associated deliverable and high level implementation details is given below.

        Time-line: April 26 - Aug 15.

        Milestones:

        April 26 - May 22 (4 weeks): Pre-coding stage. Open communication with my mentor, refine the project's plan and requirements, understand the community's code styling requirements, expand the knowledge on Hadoop and Mahout internals. Thoroughly familiarize with the classes within the classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and math packages.

        May 23 - June 3 (2 weeks): Work on Driver. Implement, test and document the class HmmDriver by extending the AbstractJob class and by reusing the code from the KMeansDriver class.

        June 3 - July 1 (4 weeks): Work on Mapper. Implement, test and document the class HmmMapper. The HmmMapper class will include setup() and map() methods. The setup() method will read in the HmmModel and the parameter values obtained from the previous iteration. The map() method will call the HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() and complete the Expectation step partially.

        July 1 - July 15 (2 weeks): Work on Reducer. Implement, test and document the class HmmReducer. The reducer will complete the Expectation step by summing over all the occurences emitted by the mappers for the three optimization problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if possible. Also, mid-term review.

        July 15 - July 29 (2 weeks): Work on Combiner. Implement, test and document the class HmmCombiner. The combiner will reduce the network traffic and improve efficiency by aggregating the values for each of the three keys corresponding to each of the optimization problems for the Maximization stage in reducers. Look at the possibility of code reuse from the KMeansCombiner class.

        July 29 - August 15 (2 weeks): Final touches. Test the mapper, reducer, combiner and driver together. Give an example demonstrating the new parallel BW algorithm by employing the parts-of-speech tagger data set also used by the sequential BW [4]. Tidy up code and fix loose ends, finish wiki documentation.

        Additional Information:

        I am in the final stages of finishing my Master's degree in Electrical and Computer Engineering from the University of Massachusetts Amherst. Working under the guidance of Prof. Wayne Burleson, as part of my Master's research work, I have applied the theory of Markov Decision Process (MDP) to increase the duration of service of mobile computers. This semester I am involved with two course projects involving machine learning over large data sets. In the Bioinformatics class, I am mining the RCSB Protein Data Bank [5] to learn the dependence of side chain geometry on a protein's secondary structure, and comparing it with the Dynamic Bayesian Network approach used in [6]. In another project for the Online Social Networks class, I am using reinforcement learning to build an online recommendation system by reformulating MDP optimal policy search as an EM problem [7] and employing Map Reduce (extending Mahout) to arrive at it in a scalable, distributed manner.
        I owe much to the open source community as all my research experiments have only been possible due to the freely available Linux distributions, performance analyzers, scripting languages and associated documentation. After joining the Apache Mahout's developer mailing list a few weeks ago, I have found the community extremely vibrant, helpful and welcoming. If selected, I feel that the GSOC 2011 project will be a great learning experience for me from both a technical and professional standpoint and will also allow me to contribute within my modest means to the overall spirit of open source programming and Machine Learning.

        References:

        [1] A tutorial on hidden markov models and selected applications in speech recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 (1989), pp. 257-286.

        [2] Map-Reduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. In NIPS (2006), pp. 281-288.

        [3] Data-Intensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. Morgan & Claypool 2010.

        [4] http://flexcrfs.sourceforge.net/#Case_Study

        [5] http://www.rcsb.org/pdb/home/home.do

        [6] Beyond rotamers: a generative, probabilistic model of side chains in proteins by Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE, Hamelryck T. BMC Bioinformatics. 2010 Jun 5.

        [7] Probabilistic inference for solving discrete and continuous state Markov Decision Processes by M. Toussaint and A. Storkey. ICML, 2006.
        Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

        Student Name: Dhruv Kumar

        Student E-mail: dkumar@ecs.umass.edu

        Organization/Project: Apache Mahout

        Assigned Mentor:

        Proposal Abstract:

        The Baum-Welch algorithm is commonly used for training a Hidden Markov Model because of its superior numerical stability and its ability to guarantee the discovery of a locally maximum, Maximum Likelihood Estimator, in the presence of incomplete training data. Currently, Apache Mahout has a sequential implementation of the Baum-Welch which cannot be scaled to train over large data sets. This restriction reduces the quality of training and constrains generalization of the learned model when used for prediction. This project proposes to extend Mahout's Baum-Welch to a parallel, distributed version using the Map-Reduce programming framework for enhanced model fitting over large data sets.

        Detailed Description:

        Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool for applications generating temporal or spatial sequential data. Relative simplicity of implementation, combined with their ability to discover latent domain knowledge have made them very popular in diverse fields such as DNA sequence alignment, gene discovery, handwriting analysis, voice recognition, computer vision, language translation and parts-of-speech tagging.

        A HMM is defined as a tuple (S, O, Theta) where S is a finite set of unobservable, hidden states emitting symbols from a finite observable vocabulary set O according to a probabilistic model Theta. The parameters of the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic transition matrix of the hidden states of size |S| X |S|. The elements a_(i,j) of A specify the probability of transitioning from a state i to state j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose elements b_(s, o) provide the probability that a symbol o will be emitted from the hidden state s. The elements pi_(s) of the |S| length vector Pi determine the probability that the system starts in the hidden state s. The transitions of hidden states are unobservable and follow the Markov property of memorylessness.

        Rabiner [1] defined three main problems for HMMs:

        1. Evaluation: Given the complete model (S, O, Theta) and a subset of the observation sequence, determine the probability that the model generated the observed sequence. This is useful for evaluating the quality of the model and is solved using the so called Forward algorithm.

        2. Decoding: Given the complete model (S, O, Theta) and an observation sequence, determine the hidden state sequence which generated the observed sequence. This can be viewed as an inference problem where the model and observed sequence are used to predict the value of the unobservable random variables. The backward algorithm, also known as the Viterbi decoding algorithm is used for predicting the hidden state sequence.

        3. Training: Given the set of hidden states S, the set of observation vocabulary O and the observation sequence, determine the parameters (A, B, Pi) of the model Theta. This problem can be viewed as a statistical machine learning problem of model fitting to a large set of training data. The Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and the Viterbi training algorithm are commonly used for model fitting.

        In general, the quality of HMM training can be improved by employing large training vectors but currently, Mahout only supports sequential versions of HMM trainers which are incapable of scaling. Among the Viterbi and the Baum-Welch training methods, the Baum-Welch algorithm is superior, accurate, and a better candidate for a parallel implementation for two reasons:
        (1) The BW is numerically stable and provides a guaranteed discovery of the locally maximum, Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In Viterbi training, the MLE is approximated in order to reduce computation time.
        (2) The BW belongs to the general class of Expectation Maximization (EM) algorithms which naturally fit into the Map-Reduce framework [2], such as the existing Map Reduce implementation of k-means in Mahout.

        Hence, this project proposes to extend Mahout's current sequential implementation of the Baum-Welch HMM trainer to a scalable, distributed case. Since the distributed version of the BW will use the sequential implementations of the Forward and the Backward algorithms to compute the alpha and the beta factors in each iteration, a lot of existing HMM training code will be reused. Specifically, the parallel implementation of the BW algorithm on Map Reduce has been elaborated at great length in [3] by viewing it as a specific case of the Expectation-Maximization algorithm and will be followed for implementation in this project.

        The BW EM algorithm iteratively refines the model's parameters and consists of two distinct steps in each iteration--Expectation and Maximization. In the distributed case, the Expectation step is computed by the mappers and the reducers, while the Maximization is handled by the reducers. Starting from an initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is input to the algorithm, and the end result Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a user specified convergence condition expressed as a fixpoint or when the number of iterations exceeds a user defined value.

        Expectation computes the posterior probability of each latent variable for each observed variable, weighed by the relative frequency of the observed variable in the input split. The mappers process independent training instances and emit expected state transition and emission counts using the Forward and Backward algorithms. The reducers finish Expectation by aggregating the expected counts. The input to a mapper consists of (k, v_o) pairs where k is a unique key and v_o is a string of observed symbols. For each training instance, the mappers emit the same set of keys corresponding to the following three optimization problems to be solved during the Maximization, and their values in a hash-map:
        (1) Expected number of times a hidden state is reached (Pi).
        (2) Number of times each observable symbol is generated by each hidden state (B).
        (3) Number of transitions between each pair of states in the hidden state space (A).

        The M step computes the updated Theta(i+1) from the values generated during the E part. This involves aggregating the values (as hash-maps) for each key corresponding to one of the optimization problems. The aggregation summarizes the statistics necessary to compute a subset of the parameters for the next EM iteration. The optimal parameters for the next iteration are arrived by computing the relative frequency of each event with respect to its expected count at the current iteration. The emitted optimal parameters by each reducer are written to the HDFS and are fed to the mappers in the next iteration.

        The project can be subdivided into distinct tasks of programming, testing and documenting the driver, mapper, reducer and the combiner with the Expectation and Maximization parts split between them. For each of these tasks, a new class will be programmed, unit tested and documented within the org.apache.mahout.classifier.sequencelearning.hmm package. Since k-means is also an EM algorithm, particular attention will paid to its code at each step for possible reuse.

        A list of milestones, associated deliverable and high level implementation details is given below.

        Time-line: April 26 - Aug 15.

        Milestones:

        April 26 - May 22 (4 weeks): Pre-coding stage. Open communication with my mentor, refine the project's plan and requirements, understand the community's code styling requirements, expand the knowledge on Hadoop and Mahout internals. Thoroughly familiarize with the classes within the classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and math packages.

        May 23 - June 3 (2 weeks): Work on Driver. Implement, test and document the class HmmDriver by extending the AbstractJob class and by reusing the code from the KMeansDriver class.

        June 3 - July 1 (4 weeks): Work on Mapper. Implement, test and document the class HmmMapper. The HmmMapper class will include setup() and map() methods. The setup() method will read in the HmmModel and the parameter values obtained from the previous iteration. The map() method will call the HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() and complete the Expectation step partially.

        July 1 - July 15 (2 weeks): Work on Reducer. Implement, test and document the class HmmReducer. The reducer will complete the Expectation step by summing over all the occurences emitted by the mappers for the three optimization problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if possible. Also, mid-term review.

        July 15 - July 29 (2 weeks): Work on Combiner. Implement, test and document the class HmmCombiner. The combiner will reduce the network traffic and improve efficiency by aggregating the values for each of the three keys corresponding to each of the optimization problems for the Maximization stage in reducers. Look at the possibility of code reuse from the KMeansCombiner class.

        July 29 - August 15 (2 weeks): Final touches. Test the mapper, reducer, combiner and driver together. Give an example demonstrating the new parallel BW algorithm by employing the parts-of-speech tagger data set also used by the sequential BW [4]. Tidy up code and fix loose ends, finish wiki documentation.

        Additional Information:

        I am in the final stages of finishing my Master's degree in Electrical and Computer Engineering from the University of Massachusetts Amherst. Working under the guidance of Prof. Wayne Burleson, as part of my Master's research work, I have applied the theory of Markov Decision Process (MDP) to increase the duration of service of mobile computers. This semester I am involved with two course projects involving machine learning over large data sets. In the Bioinformatics class, I am mining the RCSB Protein Data Bank [5] to learn the dependence of side chain geometry on a protein's secondary structure, and comparing it with the Dynamic Bayesian Network approach used in [6]. In another project for the Online Social Networks class, I am using reinforcement learning to build an online recommendation system by reformulating MDP optimal policy search as an EM problem [7] and employing Map Reduce (extending Mahout) to arrive at it in a scalable, distributed manner.
        I owe much to the open source community as all my research experiments have only been possible due to the freely available Linux distributions, performance analyzers, scripting languages and associated documentation. After joining the Apache Mahout's developer mailing list a few weeks ago, I have found the community extremely vibrant, helpful and welcoming. If selected, I feel that the GSOC 2011 project will be a great learning experience for me from both a technical and professional standpoint and will also allow me to contribute within my modest means to the overall spirit of open source programming and Machine Learning.

        References:

        [1] A tutorial on hidden markov models and selected applications in speech recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 (1989), pp. 257-286.

        [2] Map-Reduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. In NIPS (2006), pp. 281-288.

        [3] Data-Intensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. Morgan & Claypool 2010.

        [4] http://flexcrfs.sourceforge.net/#Case_Study

        [5] http://www.rcsb.org/pdb/home/home.do

        [6] Beyond rotamers: a generative, probabilistic model of side chains in proteins by Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE, Hamelryck T. BMC Bioinformatics. 2010 Jun 5.

        [7] Probabilistic inference for solving discrete and continuous state Markov Decision Processes by M. Toussaint and A. Storkey. ICML, 2006.
        Dhruv Kumar made changes -
        Description Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

        Student Name: Dhruv Kumar

        Student E-mail: dkumar@ecs.umass.edu

        Organization/Project: Apache Mahout

        Assigned Mentor:

        Proposal Abstract:

        The Baum-Welch algorithm is commonly used for training a Hidden Markov Model because of its superior numerical stability and its ability to guarantee the discovery of a locally maximum, Maximum Likelihood Estimator, in the presence of incomplete training data. Currently, Apache Mahout has a sequential implementation of the Baum-Welch which cannot be scaled to train over large data sets. This restriction reduces the quality of training and constrains generalization of the learned model in production environments. This project proposes to extend Mahout's sequential implementation of the Baum-Welch to a parallel, distributed version using the Map-Reduce programming framework to allow training at a large scale for enhanced model fitting.

        Detailed Description:

        Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool for applications generating temporal or spatial sequential data. Relative simplicity of implementation, combined with their ability to discover latent domain knowledge have made them very popular in diverse fields such as DNA sequence alignment, gene discovery, handwriting analysis, voice recognition, computer vision, language translation and parts-of-speech tagging.

        A HMM is defined as a tuple (S, O, Theta) where S is a finite set of unobservable, hidden states emitting symbols from a finite observable vocabulary set O according to a probabilistic model Theta. The parameters of the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic transition matrix of the hidden states of size |S| X |S|. The elements a_(i,j) of A specify the probability of transitioning from a state i to state j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose elements b_(s, o) provide the probability that a symbol o will be emitted from the hidden state s. The elements pi_(s) of the |S| length vector Pi determine the probability that the system starts in the hidden state s. The transitions of hidden states are unobservable and follow the Markov property of memorylessness.

        Rabiner [1] defined three main problems for HMMs:

        1. Evaluation: Given the complete model (S, O, Theta) and a subset of the observation sequence, determine the probability that the model generated the observed sequence. This is useful for evaluating the quality of the model and is solved using the so called Forward algorithm.

        2. Decoding: Given the complete model (S, O, Theta) and an observation sequence, determine the hidden state sequence which generated the observed sequence. This can be viewed as an inference problem where the model and observed sequence are used to predict the value of the unobservable random variables. The backward algorithm, also known as the Viterbi decoding algorithm is used for predicting the hidden state sequence.

        3. Training: Given the set of hidden states S, the set of observation vocabulary O and the observation sequence, determine the parameters (A, B, Pi) of the model Theta. This problem can be viewed as a statistical machine learning problem of model fitting to a large set of training data. The Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and the Viterbi training algorithm are commonly used for model fitting.

        In general, the quality of HMM training can be improved by employing large training vectors but currently, Mahout only supports sequential versions of HMM trainers which are incapable of scaling. Among the Viterbi and the Baum-Welch training methods, the Baum-Welch algorithm is superior, accurate, and a better candidate for a parallel implementation for two reasons:
        (1) The BW is numerically stable and provides a guaranteed discovery of the locally maximum, Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In Viterbi training, the MLE is approximated in order to reduce computation time.
        (2) The BW belongs to the general class of Expectation Maximization (EM) algorithms which naturally fit into the Map-Reduce framework [2].

        Hence, this project proposes to extend Mahout's current sequential implementation of the Baum-Welch HMM trainer to a scalable, distributed case. Since the distributed version of the BW will use the sequential implementations of the Forward and the Backward algorithms to compute the alpha and the beta factors in each iteration, a lot of existing HMM training code will be reused. Specifically, the parallel implementation of the BW algorithm on Map Reduce has been elaborated at great length in [3] by viewing it as a specific case of the Expectation-Maximization algorithm and will be followed for implementation in this project.

        The BW EM algorithm iteratively refines the model's parameters and consists of two distinct steps in each iteration--Expectation and Maximization. In the distributed case, the Expectation step is computed by the mappers and the reducers, while the Maximization is handled by the reducers. Starting from an initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is input to the algorithm, and the end result Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a user specified convergence condition expressed as a fixpoint or when the number of iterations exceeds a user defined value.

        Expectation computes the posterior probability of each latent variable for each observed variable, weighed by the relative frequency of the observed variable in the input split. The mappers process independent training instances and emit expected state transition and emission counts using the forward-backward algorithm. The reducers finish Expectation by aggregating the expected counts. The input to a mapper consists of (k, v_o) pairs where k is a unique key and v_o is a string of observed symbols. For each training instance, the mappers emit the same set of keys corresponding to the following three optimization problems to be solved during the Maximization, and their values in a hash-map:
        (1) Expected number of times a hidden state is reached (Pi).
        (2) Number of times each observable symbol is generated by each hidden state (B).
        (3) Number of transitions between each pair of states in the hidden state space (A).

        The M step computes the updated Theta(i+1) from the values generated during the E part. This involves aggregating the values (as hash-maps) for each key corresponding to one of the optimization problems. The aggregation summarizes the statistics necessary to compute a subset of the parameters for the next EM iteration. The optimal parameters for the next iteration are arrived by computing the relative frequency of each event with respect to its expected count at the current iteration. The emitted optimal parameters by each reducer are written to the HDFS and are fed to the mappers in the next iteration.

        The project can be subdivided into distinct tasks of programming, testing and documenting the driver, mapper, reducer and the combiner with the Expectation and Maximization parts split between them. For each of these tasks, a new class will be programmed, unit tested and documented within the org.apache.mahout.classifier.sequencelearning.hmm package. A list of milestones, associated deliverable and high level implementation details is given below.

        Time-line: April 26 - Aug 15.

        Milestones:

        April 26 - May 22 (4 weeks): Pre-coding stage. Open communication with my mentor, refine the project's plan and requirements, understand the community's code styling requirements, expand the knowledge on Hadoop and Mahout internals. Thoroughly familiarize with the classes within the classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and math packages.

        May 23 - June 3 (2 weeks): Work on Driver. Implement, test and document the class HmmDriver by extending the AbstractJob class and by reusing the code from the KMeansDriver class.

        June 3 - July 1 (4 weeks): Work on Mapper. Implement, test and document the class HmmMapper. The HmmMapper class will include setup() and map() methods. The setup() method will read in the HmmModel and the parameter values obtained from the previous iteration. The map() method will call the HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() and complete the Expectation step partially.

        July 1 - July 15 (2 weeks): Work on Reducer. Implement, test and document the class HmmReducer. The reducer will complete the Expectation step by summing over all the occurences emitted by the mappers for the three optimization problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if possible. Also, mid-term review.

        July 15 - July 29 (2 weeks): Work on Combiner. Implement, test and document the class HmmCombiner. The combiner will reduce the network traffic and improve efficiency by aggregating the values for each of the three keys corresponding to each of the optimization problems for the Maximization stage in reducers. Look at the possibility of code reuse from the KMeansCombiner class.

        July 29 - August 15 (2 weeks): Final touches. Test the mapper, reducer, combiner and driver together. Give an example demonstrating the new parallel BW algorithm by employing the parts-of-speech tagger data set also used by the sequential BW [4]. Tidy up code and fix loose ends, finish wiki documentation.

        Additional Information:

        I am in the final stages of finishing my Master's degree in Electrical and Computer Engineering from the University of Massachusetts Amherst. Working under the guidance of Prof. Wayne Burleson, as part of my Master's research work, I have applied the theory of Markov Decision Process (MDP) to increase the duration of service of mobile computers. This semester I am involved with two course projects involving machine learning over large data sets. In the Bioinformatics class, I am mining the RCSB Protein Data Bank [5] to learn the dependence of side chain geometry on a protein's secondary structure, and comparing it with the Dynamic Bayesian Network approach used in [6]. In another project for the Online Social Networks class, I am using reinforcement learning to build an online recommendation system by reformulating MDP optimal policy search as an EM problem [7] and employing Map Reduce (extending Mahout) to arrive at it in a scalable, distributed manner.
        I owe much to the open source community as all my research experiments have only been possible due to the freely available Linux distributions, performance analyzers, scripting languages and associated documentation. After joining the Apache Mahout's developer mailing list a few weeks ago, I have found the community extremely vibrant, helpful and welcoming. If selected, I feel that the GSOC 2011 project will be a great learning experience for me from both a technical and professional standpoint and will also allow me to contribute within my modest means to the overall spirit of open source programming and Machine Learning.

        References:

        [1] A tutorial on hidden markov models and selected applications in speech recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 (1989), pp. 257-286.

        [2] Map-Reduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. In NIPS (2006), pp. 281-288.

        [3] Data-Intensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. Morgan & Claypool 2010.

        [4] http://flexcrfs.sourceforge.net/#Case_Study

        [5] http://www.rcsb.org/pdb/home/home.do

        [6] Beyond rotamers: a generative, probabilistic model of side chains in proteins by Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE, Hamelryck T. BMC Bioinformatics. 2010 Jun 5.

        [7] Probabilistic inference for solving discrete and continuous state Markov Decision Processes by M. Toussaint and A. Storkey. ICML, 2006.
        Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

        Student Name: Dhruv Kumar

        Student E-mail: dkumar@ecs.umass.edu

        Organization/Project: Apache Mahout

        Assigned Mentor:

        Proposal Abstract:

        The Baum-Welch algorithm is commonly used for training a Hidden Markov Model because of its superior numerical stability and its ability to guarantee the discovery of a locally maximum, Maximum Likelihood Estimator, in the presence of incomplete training data. Currently, Apache Mahout has a sequential implementation of the Baum-Welch which cannot be scaled to train over large data sets. This restriction reduces the quality of training and constrains generalization of the learned model when used for prediction. This project proposes to extend Mahout's sequential implementation of the Baum-Welch to a parallel, distributed version using the Map-Reduce programming framework to allow training at a large scale for enhanced model fitting.

        Detailed Description:

        Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool for applications generating temporal or spatial sequential data. Relative simplicity of implementation, combined with their ability to discover latent domain knowledge have made them very popular in diverse fields such as DNA sequence alignment, gene discovery, handwriting analysis, voice recognition, computer vision, language translation and parts-of-speech tagging.

        A HMM is defined as a tuple (S, O, Theta) where S is a finite set of unobservable, hidden states emitting symbols from a finite observable vocabulary set O according to a probabilistic model Theta. The parameters of the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic transition matrix of the hidden states of size |S| X |S|. The elements a_(i,j) of A specify the probability of transitioning from a state i to state j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose elements b_(s, o) provide the probability that a symbol o will be emitted from the hidden state s. The elements pi_(s) of the |S| length vector Pi determine the probability that the system starts in the hidden state s. The transitions of hidden states are unobservable and follow the Markov property of memorylessness.

        Rabiner [1] defined three main problems for HMMs:

        1. Evaluation: Given the complete model (S, O, Theta) and a subset of the observation sequence, determine the probability that the model generated the observed sequence. This is useful for evaluating the quality of the model and is solved using the so called Forward algorithm.

        2. Decoding: Given the complete model (S, O, Theta) and an observation sequence, determine the hidden state sequence which generated the observed sequence. This can be viewed as an inference problem where the model and observed sequence are used to predict the value of the unobservable random variables. The backward algorithm, also known as the Viterbi decoding algorithm is used for predicting the hidden state sequence.

        3. Training: Given the set of hidden states S, the set of observation vocabulary O and the observation sequence, determine the parameters (A, B, Pi) of the model Theta. This problem can be viewed as a statistical machine learning problem of model fitting to a large set of training data. The Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and the Viterbi training algorithm are commonly used for model fitting.

        In general, the quality of HMM training can be improved by employing large training vectors but currently, Mahout only supports sequential versions of HMM trainers which are incapable of scaling. Among the Viterbi and the Baum-Welch training methods, the Baum-Welch algorithm is superior, accurate, and a better candidate for a parallel implementation for two reasons:
        (1) The BW is numerically stable and provides a guaranteed discovery of the locally maximum, Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In Viterbi training, the MLE is approximated in order to reduce computation time.
        (2) The BW belongs to the general class of Expectation Maximization (EM) algorithms which naturally fit into the Map-Reduce framework [2].

        Hence, this project proposes to extend Mahout's current sequential implementation of the Baum-Welch HMM trainer to a scalable, distributed case. Since the distributed version of the BW will use the sequential implementations of the Forward and the Backward algorithms to compute the alpha and the beta factors in each iteration, a lot of existing HMM training code will be reused. Specifically, the parallel implementation of the BW algorithm on Map Reduce has been elaborated at great length in [3] by viewing it as a specific case of the Expectation-Maximization algorithm and will be followed for implementation in this project.

        The BW EM algorithm iteratively refines the model's parameters and consists of two distinct steps in each iteration--Expectation and Maximization. In the distributed case, the Expectation step is computed by the mappers and the reducers, while the Maximization is handled by the reducers. Starting from an initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is input to the algorithm, and the end result Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a user specified convergence condition expressed as a fixpoint or when the number of iterations exceeds a user defined value.

        Expectation computes the posterior probability of each latent variable for each observed variable, weighed by the relative frequency of the observed variable in the input split. The mappers process independent training instances and emit expected state transition and emission counts using the forward-backward algorithm. The reducers finish Expectation by aggregating the expected counts. The input to a mapper consists of (k, v_o) pairs where k is a unique key and v_o is a string of observed symbols. For each training instance, the mappers emit the same set of keys corresponding to the following three optimization problems to be solved during the Maximization, and their values in a hash-map:
        (1) Expected number of times a hidden state is reached (Pi).
        (2) Number of times each observable symbol is generated by each hidden state (B).
        (3) Number of transitions between each pair of states in the hidden state space (A).

        The M step computes the updated Theta(i+1) from the values generated during the E part. This involves aggregating the values (as hash-maps) for each key corresponding to one of the optimization problems. The aggregation summarizes the statistics necessary to compute a subset of the parameters for the next EM iteration. The optimal parameters for the next iteration are arrived by computing the relative frequency of each event with respect to its expected count at the current iteration. The emitted optimal parameters by each reducer are written to the HDFS and are fed to the mappers in the next iteration.

        The project can be subdivided into distinct tasks of programming, testing and documenting the driver, mapper, reducer and the combiner with the Expectation and Maximization parts split between them. For each of these tasks, a new class will be programmed, unit tested and documented within the org.apache.mahout.classifier.sequencelearning.hmm package. A list of milestones, associated deliverable and high level implementation details is given below.

        Time-line: April 26 - Aug 15.

        Milestones:

        April 26 - May 22 (4 weeks): Pre-coding stage. Open communication with my mentor, refine the project's plan and requirements, understand the community's code styling requirements, expand the knowledge on Hadoop and Mahout internals. Thoroughly familiarize with the classes within the classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and math packages.

        May 23 - June 3 (2 weeks): Work on Driver. Implement, test and document the class HmmDriver by extending the AbstractJob class and by reusing the code from the KMeansDriver class.

        June 3 - July 1 (4 weeks): Work on Mapper. Implement, test and document the class HmmMapper. The HmmMapper class will include setup() and map() methods. The setup() method will read in the HmmModel and the parameter values obtained from the previous iteration. The map() method will call the HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() and complete the Expectation step partially.

        July 1 - July 15 (2 weeks): Work on Reducer. Implement, test and document the class HmmReducer. The reducer will complete the Expectation step by summing over all the occurences emitted by the mappers for the three optimization problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if possible. Also, mid-term review.

        July 15 - July 29 (2 weeks): Work on Combiner. Implement, test and document the class HmmCombiner. The combiner will reduce the network traffic and improve efficiency by aggregating the values for each of the three keys corresponding to each of the optimization problems for the Maximization stage in reducers. Look at the possibility of code reuse from the KMeansCombiner class.

        July 29 - August 15 (2 weeks): Final touches. Test the mapper, reducer, combiner and driver together. Give an example demonstrating the new parallel BW algorithm by employing the parts-of-speech tagger data set also used by the sequential BW [4]. Tidy up code and fix loose ends, finish wiki documentation.

        Additional Information:

        I am in the final stages of finishing my Master's degree in Electrical and Computer Engineering from the University of Massachusetts Amherst. Working under the guidance of Prof. Wayne Burleson, as part of my Master's research work, I have applied the theory of Markov Decision Process (MDP) to increase the duration of service of mobile computers. This semester I am involved with two course projects involving machine learning over large data sets. In the Bioinformatics class, I am mining the RCSB Protein Data Bank [5] to learn the dependence of side chain geometry on a protein's secondary structure, and comparing it with the Dynamic Bayesian Network approach used in [6]. In another project for the Online Social Networks class, I am using reinforcement learning to build an online recommendation system by reformulating MDP optimal policy search as an EM problem [7] and employing Map Reduce (extending Mahout) to arrive at it in a scalable, distributed manner.
        I owe much to the open source community as all my research experiments have only been possible due to the freely available Linux distributions, performance analyzers, scripting languages and associated documentation. After joining the Apache Mahout's developer mailing list a few weeks ago, I have found the community extremely vibrant, helpful and welcoming. If selected, I feel that the GSOC 2011 project will be a great learning experience for me from both a technical and professional standpoint and will also allow me to contribute within my modest means to the overall spirit of open source programming and Machine Learning.

        References:

        [1] A tutorial on hidden markov models and selected applications in speech recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 (1989), pp. 257-286.

        [2] Map-Reduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. In NIPS (2006), pp. 281-288.

        [3] Data-Intensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. Morgan & Claypool 2010.

        [4] http://flexcrfs.sourceforge.net/#Case_Study

        [5] http://www.rcsb.org/pdb/home/home.do

        [6] Beyond rotamers: a generative, probabilistic model of side chains in proteins by Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE, Hamelryck T. BMC Bioinformatics. 2010 Jun 5.

        [7] Probabilistic inference for solving discrete and continuous state Markov Decision Processes by M. Toussaint and A. Storkey. ICML, 2006.
        Hide
        Dhruv Kumar added a comment -

        Updated the abstract.

        Show
        Dhruv Kumar added a comment - Updated the abstract.
        Dhruv Kumar made changes -
        Description Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

        Student Name: Dhruv Kumar

        Student E-mail: dkumar@ecs.umass.edu

        Organization/Project: Apache Mahout

        Assigned Mentor:

        Proposal Abstract:

        The Baum-Welch algorithm is commonly used for training a Hidden Markov Model because of its superior numerical stability and its ability to guarantee the discovery of a locally maximum, Maximum Likelihood Estimator, in the presence of incomplete training data. Currently, Apache Mahout has a sequential implementation of the Baum-Welch which cannot be scaled to train over large data sets. This restriction reduces the quality of training and as an effect, constraints generalization of the learned model in production environments. This project proposes to extend Mahout's sequential implementation of the Baum-Welch to a parallel, distributed version using the Map-Reduce programming framework to allow training at a large scale for enhanced model fitting.

        Detailed Description:

        Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool for applications generating temporal or spatial sequential data. Relative simplicity of implementation, combined with their ability to discover latent domain knowledge have made them very popular in diverse fields such as DNA sequence alignment, gene discovery, handwriting analysis, voice recognition, computer vision, language translation and parts-of-speech tagging.

        A HMM is defined as a tuple (S, O, Theta) where S is a finite set of unobservable, hidden states emitting symbols from a finite observable vocabulary set O according to a probabilistic model Theta. The parameters of the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic transition matrix of the hidden states of size |S| X |S|. The elements a_(i,j) of A specify the probability of transitioning from a state i to state j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose elements b_(s, o) provide the probability that a symbol o will be emitted from the hidden state s. The elements pi_(s) of the |S| length vector Pi determine the probability that the system starts in the hidden state s. The transitions of hidden states are unobservable and follow the Markov property of memorylessness.

        Rabiner [1] defined three main problems for HMMs:

        1. Evaluation: Given the complete model (S, O, Theta) and a subset of the observation sequence, determine the probability that the model generated the observed sequence. This is useful for evaluating the quality of the model and is solved using the so called Forward algorithm.

        2. Decoding: Given the complete model (S, O, Theta) and an observation sequence, determine the hidden state sequence which generated the observed sequence. This can be viewed as an inference problem where the model and observed sequence are used to predict the value of the unobservable random variables. The backward algorithm, also known as the Viterbi decoding algorithm is used for predicting the hidden state sequence.

        3. Training: Given the set of hidden states S, the set of observation vocabulary O and the observation sequence, determine the parameters (A, B, Pi) of the model Theta. This problem can be viewed as a statistical machine learning problem of model fitting to a large set of training data. The Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and the Viterbi training algorithm are commonly used for model fitting.

        In general, the quality of HMM training can be improved by employing large training vectors but currently, Mahout only supports sequential versions of HMM trainers which are incapable of scaling. Among the Viterbi and the Baum-Welch training methods, the Baum-Welch algorithm is superior, accurate, and a better candidate for a parallel implementation for two reasons:
        (1) The BW is numerically stable and provides a guaranteed discovery of the locally maximum, Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In Viterbi training, the MLE is approximated in order to reduce computation time.
        (2) The BW belongs to the general class of Expectation Maximization (EM) algorithms which naturally fit into the Map-Reduce framework [2].

        Hence, this project proposes to extend Mahout's current sequential implementation of the Baum-Welch HMM trainer to a scalable, distributed case. Since the distributed version of the BW will use the sequential implementations of the Forward and the Backward algorithms to compute the alpha and the beta factors in each iteration, a lot of existing HMM training code will be reused. Specifically, the parallel implementation of the BW algorithm on Map Reduce has been elaborated at great length in [3] by viewing it as a specific case of the Expectation-Maximization algorithm and will be followed for implementation in this project.

        The BW EM algorithm iteratively refines the model's parameters and consists of two distinct steps in each iteration--Expectation and Maximization. In the distributed case, the Expectation step is computed by the mappers and the reducers, while the Maximization is handled by the reducers. Starting from an initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is input to the algorithm, and the end result Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a user specified convergence condition expressed as a fixpoint or when the number of iterations exceeds a user defined value.

        Expectation computes the posterior probability of each latent variable for each observed variable, weighed by the relative frequency of the observed variable in the input split. The mappers process independent training instances and emit expected state transition and emission counts using the forward-backward algorithm. The reducers finish Expectation by aggregating the expected counts. The input to a mapper consists of (k, v_o) pairs where k is a unique key and v_o is a string of observed symbols. For each training instance, the mappers emit the same set of keys corresponding to the following three optimization problems to be solved during the Maximization, and their values in a hash-map:
        (1) Expected number of times a hidden state is reached (Pi).
        (2) Number of times each observable symbol is generated by each hidden state (B).
        (3) Number of transitions between each pair of states in the hidden state space (A).

        The M step computes the updated Theta(i+1) from the values generated during the E part. This involves aggregating the values (as hash-maps) for each key corresponding to one of the optimization problems. The aggregation summarizes the statistics necessary to compute a subset of the parameters for the next EM iteration. The optimal parameters for the next iteration are arrived by computing the relative frequency of each event with respect to its expected count at the current iteration. The emitted optimal parameters by each reducer are written to the HDFS and are fed to the mappers in the next iteration.

        The project can be subdivided into distinct tasks of programming, testing and documenting the driver, mapper, reducer and the combiner with the Expectation and Maximization parts split between them. For each of these tasks, a new class will be programmed, unit tested and documented within the org.apache.mahout.classifier.sequencelearning.hmm package. A list of milestones, associated deliverable and high level implementation details is given below.

        Time-line: April 26 - Aug 15.

        Milestones:

        April 26 - May 22 (4 weeks): Pre-coding stage. Open communication with my mentor, refine the project's plan and requirements, understand the community's code styling requirements, expand the knowledge on Hadoop and Mahout internals. Thoroughly familiarize with the classes within the classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and math packages.

        May 23 - June 3 (2 weeks): Work on Driver. Implement, test and document the class HmmDriver by extending the AbstractJob class and by reusing the code from the KMeansDriver class.

        June 3 - July 1 (4 weeks): Work on Mapper. Implement, test and document the class HmmMapper. The HmmMapper class will include setup() and map() methods. The setup() method will read in the HmmModel and the parameter values obtained from the previous iteration. The map() method will call the HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() and complete the Expectation step partially.

        July 1 - July 15 (2 weeks): Work on Reducer. Implement, test and document the class HmmReducer. The reducer will complete the Expectation step by summing over all the occurences emitted by the mappers for the three optimization problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if possible. Also, mid-term review.

        July 15 - July 29 (2 weeks): Work on Combiner. Implement, test and document the class HmmCombiner. The combiner will reduce the network traffic and improve efficiency by aggregating the values for each of the three keys corresponding to each of the optimization problems for the Maximization stage in reducers. Look at the possibility of code reuse from the KMeansCombiner class.

        July 29 - August 15 (2 weeks): Final touches. Test the mapper, reducer, combiner and driver together. Give an example demonstrating the new parallel BW algorithm by employing the parts-of-speech tagger data set also used by the sequential BW [4]. Tidy up code and fix loose ends, finish wiki documentation.

        Additional Information:

        I am in the final stages of finishing my Master's degree in Electrical and Computer Engineering from the University of Massachusetts Amherst. Working under the guidance of Prof. Wayne Burleson, as part of my Master's research work, I have applied the theory of Markov Decision Process (MDP) to increase the duration of service of mobile computers. This semester I am involved with two course projects involving machine learning over large data sets. In the Bioinformatics class, I am mining the RCSB Protein Data Bank [5] to learn the dependence of side chain geometry on a protein's secondary structure, and comparing it with the Dynamic Bayesian Network approach used in [6]. In another project for the Online Social Networks class, I am using reinforcement learning to build an online recommendation system by reformulating MDP optimal policy search as an EM problem [7] and employing Map Reduce (extending Mahout) to arrive at it in a scalable, distributed manner.
        I owe much to the open source community as all my research experiments have only been possible due to the freely available Linux distributions, performance analyzers, scripting languages and associated documentation. After joining the Apache Mahout's developer mailing list a few weeks ago, I have found the community extremely vibrant, helpful and welcoming. If selected, I feel that the GSOC 2011 project will be a great learning experience for me from both a technical and professional standpoint and will also allow me to contribute within my modest means to the overall spirit of open source programming and Machine Learning.

        References:

        [1] A tutorial on hidden markov models and selected applications in speech recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 (1989), pp. 257-286.

        [2] Map-Reduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. In NIPS (2006), pp. 281-288.

        [3] Data-Intensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. Morgan & Claypool 2010.

        [4] http://flexcrfs.sourceforge.net/#Case_Study

        [5] http://www.rcsb.org/pdb/home/home.do

        [6] Beyond rotamers: a generative, probabilistic model of side chains in proteins by Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE, Hamelryck T. BMC Bioinformatics. 2010 Jun 5.

        [7] Probabilistic inference for solving discrete and continuous state Markov Decision Processes by M. Toussaint and A. Storkey. ICML, 2006.
        Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

        Student Name: Dhruv Kumar

        Student E-mail: dkumar@ecs.umass.edu

        Organization/Project: Apache Mahout

        Assigned Mentor:

        Proposal Abstract:

        The Baum-Welch algorithm is commonly used for training a Hidden Markov Model because of its superior numerical stability and its ability to guarantee the discovery of a locally maximum, Maximum Likelihood Estimator, in the presence of incomplete training data. Currently, Apache Mahout has a sequential implementation of the Baum-Welch which cannot be scaled to train over large data sets. This restriction reduces the quality of training and constrains generalization of the learned model in production environments. This project proposes to extend Mahout's sequential implementation of the Baum-Welch to a parallel, distributed version using the Map-Reduce programming framework to allow training at a large scale for enhanced model fitting.

        Detailed Description:

        Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool for applications generating temporal or spatial sequential data. Relative simplicity of implementation, combined with their ability to discover latent domain knowledge have made them very popular in diverse fields such as DNA sequence alignment, gene discovery, handwriting analysis, voice recognition, computer vision, language translation and parts-of-speech tagging.

        A HMM is defined as a tuple (S, O, Theta) where S is a finite set of unobservable, hidden states emitting symbols from a finite observable vocabulary set O according to a probabilistic model Theta. The parameters of the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic transition matrix of the hidden states of size |S| X |S|. The elements a_(i,j) of A specify the probability of transitioning from a state i to state j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose elements b_(s, o) provide the probability that a symbol o will be emitted from the hidden state s. The elements pi_(s) of the |S| length vector Pi determine the probability that the system starts in the hidden state s. The transitions of hidden states are unobservable and follow the Markov property of memorylessness.

        Rabiner [1] defined three main problems for HMMs:

        1. Evaluation: Given the complete model (S, O, Theta) and a subset of the observation sequence, determine the probability that the model generated the observed sequence. This is useful for evaluating the quality of the model and is solved using the so called Forward algorithm.

        2. Decoding: Given the complete model (S, O, Theta) and an observation sequence, determine the hidden state sequence which generated the observed sequence. This can be viewed as an inference problem where the model and observed sequence are used to predict the value of the unobservable random variables. The backward algorithm, also known as the Viterbi decoding algorithm is used for predicting the hidden state sequence.

        3. Training: Given the set of hidden states S, the set of observation vocabulary O and the observation sequence, determine the parameters (A, B, Pi) of the model Theta. This problem can be viewed as a statistical machine learning problem of model fitting to a large set of training data. The Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and the Viterbi training algorithm are commonly used for model fitting.

        In general, the quality of HMM training can be improved by employing large training vectors but currently, Mahout only supports sequential versions of HMM trainers which are incapable of scaling. Among the Viterbi and the Baum-Welch training methods, the Baum-Welch algorithm is superior, accurate, and a better candidate for a parallel implementation for two reasons:
        (1) The BW is numerically stable and provides a guaranteed discovery of the locally maximum, Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In Viterbi training, the MLE is approximated in order to reduce computation time.
        (2) The BW belongs to the general class of Expectation Maximization (EM) algorithms which naturally fit into the Map-Reduce framework [2].

        Hence, this project proposes to extend Mahout's current sequential implementation of the Baum-Welch HMM trainer to a scalable, distributed case. Since the distributed version of the BW will use the sequential implementations of the Forward and the Backward algorithms to compute the alpha and the beta factors in each iteration, a lot of existing HMM training code will be reused. Specifically, the parallel implementation of the BW algorithm on Map Reduce has been elaborated at great length in [3] by viewing it as a specific case of the Expectation-Maximization algorithm and will be followed for implementation in this project.

        The BW EM algorithm iteratively refines the model's parameters and consists of two distinct steps in each iteration--Expectation and Maximization. In the distributed case, the Expectation step is computed by the mappers and the reducers, while the Maximization is handled by the reducers. Starting from an initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is input to the algorithm, and the end result Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a user specified convergence condition expressed as a fixpoint or when the number of iterations exceeds a user defined value.

        Expectation computes the posterior probability of each latent variable for each observed variable, weighed by the relative frequency of the observed variable in the input split. The mappers process independent training instances and emit expected state transition and emission counts using the forward-backward algorithm. The reducers finish Expectation by aggregating the expected counts. The input to a mapper consists of (k, v_o) pairs where k is a unique key and v_o is a string of observed symbols. For each training instance, the mappers emit the same set of keys corresponding to the following three optimization problems to be solved during the Maximization, and their values in a hash-map:
        (1) Expected number of times a hidden state is reached (Pi).
        (2) Number of times each observable symbol is generated by each hidden state (B).
        (3) Number of transitions between each pair of states in the hidden state space (A).

        The M step computes the updated Theta(i+1) from the values generated during the E part. This involves aggregating the values (as hash-maps) for each key corresponding to one of the optimization problems. The aggregation summarizes the statistics necessary to compute a subset of the parameters for the next EM iteration. The optimal parameters for the next iteration are arrived by computing the relative frequency of each event with respect to its expected count at the current iteration. The emitted optimal parameters by each reducer are written to the HDFS and are fed to the mappers in the next iteration.

        The project can be subdivided into distinct tasks of programming, testing and documenting the driver, mapper, reducer and the combiner with the Expectation and Maximization parts split between them. For each of these tasks, a new class will be programmed, unit tested and documented within the org.apache.mahout.classifier.sequencelearning.hmm package. A list of milestones, associated deliverable and high level implementation details is given below.

        Time-line: April 26 - Aug 15.

        Milestones:

        April 26 - May 22 (4 weeks): Pre-coding stage. Open communication with my mentor, refine the project's plan and requirements, understand the community's code styling requirements, expand the knowledge on Hadoop and Mahout internals. Thoroughly familiarize with the classes within the classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and math packages.

        May 23 - June 3 (2 weeks): Work on Driver. Implement, test and document the class HmmDriver by extending the AbstractJob class and by reusing the code from the KMeansDriver class.

        June 3 - July 1 (4 weeks): Work on Mapper. Implement, test and document the class HmmMapper. The HmmMapper class will include setup() and map() methods. The setup() method will read in the HmmModel and the parameter values obtained from the previous iteration. The map() method will call the HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() and complete the Expectation step partially.

        July 1 - July 15 (2 weeks): Work on Reducer. Implement, test and document the class HmmReducer. The reducer will complete the Expectation step by summing over all the occurences emitted by the mappers for the three optimization problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if possible. Also, mid-term review.

        July 15 - July 29 (2 weeks): Work on Combiner. Implement, test and document the class HmmCombiner. The combiner will reduce the network traffic and improve efficiency by aggregating the values for each of the three keys corresponding to each of the optimization problems for the Maximization stage in reducers. Look at the possibility of code reuse from the KMeansCombiner class.

        July 29 - August 15 (2 weeks): Final touches. Test the mapper, reducer, combiner and driver together. Give an example demonstrating the new parallel BW algorithm by employing the parts-of-speech tagger data set also used by the sequential BW [4]. Tidy up code and fix loose ends, finish wiki documentation.

        Additional Information:

        I am in the final stages of finishing my Master's degree in Electrical and Computer Engineering from the University of Massachusetts Amherst. Working under the guidance of Prof. Wayne Burleson, as part of my Master's research work, I have applied the theory of Markov Decision Process (MDP) to increase the duration of service of mobile computers. This semester I am involved with two course projects involving machine learning over large data sets. In the Bioinformatics class, I am mining the RCSB Protein Data Bank [5] to learn the dependence of side chain geometry on a protein's secondary structure, and comparing it with the Dynamic Bayesian Network approach used in [6]. In another project for the Online Social Networks class, I am using reinforcement learning to build an online recommendation system by reformulating MDP optimal policy search as an EM problem [7] and employing Map Reduce (extending Mahout) to arrive at it in a scalable, distributed manner.
        I owe much to the open source community as all my research experiments have only been possible due to the freely available Linux distributions, performance analyzers, scripting languages and associated documentation. After joining the Apache Mahout's developer mailing list a few weeks ago, I have found the community extremely vibrant, helpful and welcoming. If selected, I feel that the GSOC 2011 project will be a great learning experience for me from both a technical and professional standpoint and will also allow me to contribute within my modest means to the overall spirit of open source programming and Machine Learning.

        References:

        [1] A tutorial on hidden markov models and selected applications in speech recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 (1989), pp. 257-286.

        [2] Map-Reduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. In NIPS (2006), pp. 281-288.

        [3] Data-Intensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. Morgan & Claypool 2010.

        [4] http://flexcrfs.sourceforge.net/#Case_Study

        [5] http://www.rcsb.org/pdb/home/home.do

        [6] Beyond rotamers: a generative, probabilistic model of side chains in proteins by Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE, Hamelryck T. BMC Bioinformatics. 2010 Jun 5.

        [7] Probabilistic inference for solving discrete and continuous state Markov Decision Processes by M. Toussaint and A. Storkey. ICML, 2006.
        Dhruv Kumar made changes -
        Description Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

        Student Name: Dhruv Kumar

        Student E-mail: dkumar@ecs.umass.edu

        Organization/Project: Apache Mahout

        Assigned Mentor:

        Proposal Abstract:

        The Baum-Welch algorithm is commonly used for training a Hidden Markov Model as it is numerically stable and provides a guaranteed discovery of the Maximum Likelihood Estimator in the presence of incomplete data. Currently, Apache Mahout has a sequential implementation of the Baum-Welch which cannot be scaled to train over large data sets. This project proposes to extend the sequential implementation of the Baum-Welch to a parallel, distributed version using the Map Reduce programming framework to allow scalable Hidden Markov Model training.

        Detailed Description:

        Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool for applications generating temporal or spatial sequential data. Their relative simplicity of implementation and their ability to discover latent domain knowledge have made them very popular in fields such as DNA sequence alignment, handwriting analysis, voice recognition, computer vision and parts-of-speech tagging.

        A HMM is defined as a tuple (S, O, Theta) where S is a finite set of unobservable, hidden states emitting symbols from a finite observable vocabulary set O according to a probabilistic model Theta. The parameters of the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic transition matrix of the hidden states of size |S| X |S|. The elements a_(i,j) of A specify the probability of transitioning from a state i to state j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose elements b_(s, o) provide the probability that a symbol o will be emitted from the hidden state s. The elements pi_(s) of the |S| length vector Pi determine the probability that the system starts in the hidden state s. The transitions of hidden states are unobservable and follow the Markov property of memorylessness.

        Rabiner [1] defined three main problems for HMMs:

        1. Evaluation: Given the complete model (S, O, Theta) and a subset of the observation sequence, determine the probability that the model generated the observed sequence. This is useful for determining the quality of the model and is solved using the so called Forward algorithm.

        2. Decoding: Given the complete model (S, O, Theta) and an observation sequence, determine the hidden state sequence which generated the observed sequence. This can be viewed as an inference problem where the model and observed sequence are used to predict the value of the unobservable random variables. The backward algorithm, also known as the Viterbi decoding algorithm is used for predicting the hidden state sequence.

        3. Training: Given the set of hidden states S, the set of observation vocabulary O and the observation sequence, determine the parameters (A, B, Pi) of the model Theta. This problem can be viewed as a statistical machine learning problem of model fitting to a large set of training data. The Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and the Viterbi training algorithm are commonly used for model fitting.

        In general, the quality of HMM training can be improved by employing large training vectors but currently, Mahout only supports sequential versions of HMM trainers which are incapable of scaling. Among the Viterbi and the Baum-Welch training methods, the Baum-Welch algorithm is slower but more accurate, and a better candidate for a parallel implementation for two reasons:
        (1) The BW is more numerically stable and provides a guaranteed local maximum of the Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In Viterbi training, the MLE is approximated in order to reduce computation time.
        (2) The BW belongs to the general class of Expectation Maximization (EM) algorithms which naturally fit into the Map Reduce framework [2].

        Hence, this project proposes to extend Mahout's current sequential implementation of the Baum-Welch HMM trainer to a scalable, distributed case. Since the distributed version of the BW will use the sequential implementations of the Forward and the Backward algorithms to compute the alpha and the beta factors in each iteration, a lot of existing HMM training code will be reused. Specifically, the parallel implementation of the BW algorithm on Map Reduce has been elaborated at great length in [3] by viewing it as a specific case of the Expectation-Maximization algorithm and will be followed for implementation in this project.

        The BW EM algorithm iteratively refines the model's parameters and consists of two distinct steps in each iteration--Expectation and Maximization. In the distributed case, the Expectation step is computed by the mappers and the reducers, while the Maximization is handled by the reducers. Starting from an initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is input to the algorithm, and the end result Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a user specified convergence condition expressed as a fixpoint or when the number of iterations exceeds a user defined value.

        Expectation computes the posterior probability of each latent variable for each observed variable, weighed by the relative frequency of the observed variable in the input split. The mappers process independent training instances and emit expected state transition and emission counts using the forward-backward algorithm. The reducers finish Expectation by aggregating the expected counts. The input to a mapper consists of (k, v_o) pairs where k is a unique key and v_o is a string of observed symbols. For each training instance, the mappers emit the same set of keys corresponding to the following three optimization problems to be solved during the Maximization, and their values in a hash-map:
        (1) Expected number of times a hidden state is reached (Pi).
        (2) Number of times each observable symbol is generated by each hidden state (B).
        (3) Number of transitions between each pair of states in the hidden state space (A).

        The M step computes the updated Theta(i+1) from the values generated during the E part. This involves aggregating the values (as hash-maps) for each key corresponding to one of the optimization problems. The aggregation summarizes the statistics necessary to compute a subset of the parameters for the next EM iteration. The optimal parameters for the next iteration are arrived by computing the relative frequency of each event with respect to its expected count at the current iteration. The emitted optimal parameters by each reducer are written to the HDFS and are fed to the mappers in the next iteration.

        The project can be subdivided into distinct tasks of programming, testing and documenting the driver, mapper, reducer and the combiner with the Expectation and Maximization parts split between them. For each of these tasks, a new class will be programmed, unit tested and documented within the org.apache.mahout.classifier.sequencelearning.hmm package. A list of milestones, associated deliverable and high level implementation details is given below.

        Time-line: April 26 - Aug 15.

        Milestones:

        April 26 - May 22 (4 weeks): Pre-coding stage. Open communication with my mentor, refine the project's plan and requirements, understand the community's code styling requirements, expand the knowledge on Hadoop and Mahout internals. Thoroughly familiarize with the classes within the classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and math packages.

        May 23 - June 3 (2 weeks): Work on Driver. Implement, test and document the class HmmDriver by extending the AbstractJob class and by reusing the code from the KMeansDriver class.

        June 3 - July 1 (4 weeks): Work on Mapper. Implement, test and document the class HmmMapper. The HmmMapper class will include setup() and map() methods. The setup() method will read in the HmmModel and the parameter values obtained from the previous iteration. The map() method will call the HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() and complete the Expectation step partially.

        July 1 - July 15 (2 weeks): Work on Reducer. Implement, test and document the class HmmReducer. The reducer will complete the Expectation step by summing over all the occurences emitted by the mappers for the three optimization problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if possible. Also, mid-term review.

        July 15 - July 29 (2 weeks): Work on Combiner. Implement, test and document the class HmmCombiner. The combiner will reduce the network traffic and improve efficiency by aggregating the values for each of the three keys corresponding to each of the optimization problems for the Maximization stage in reducers. Look at the possibility of code reuse from the KMeansCombiner class.

        July 29 - August 15 (2 weeks): Final touches. Test the mapper, reducer, combiner and driver together. Give an example demonstrating the new parallel BW algorithm by employing the parts-of-speech tagger data set also used by the sequential BW [4]. Tidy up code and fix loose ends, finish wiki documentation.

        Additional Information:

        I am in the final stages of finishing my Master's degree in Electrical and Computer Engineering from the University of Massachusetts Amherst. Working under the guidance of Prof. Wayne Burleson, as part of my Master's research work, I have applied the theory of Markov Decision Process (MDP) to increase the duration of service of mobile computers. This semester I am involved with two course projects involving machine learning over large data sets. In the Bioinformatics class, I am mining the RCSB Protein Data Bank [5] to learn the dependence of side chain geometry on a protein's secondary structure, and comparing it with the Dynamic Bayesian Network approach used in [6]. In another project for the Online Social Networks class, I am using reinforcement learning to build an online recommendation system by reformulating MDP optimal policy search as an EM problem [7] and employing Map Reduce (extending Mahout) to arrive at it in a scalable, distributed manner.
        I owe much to the open source community as all my research experiments have only been possible due to the freely available Linux distributions, performance analyzers, scripting languages and documentation. Since joining the Apache dev mailing list, I have found the Apache Mahout's developer community vibrant, helpful and welcoming. If selected, I feel that the GSOC 2011 project will be a great learning experience for me from both a technical and professional standpoint and also allow me to contribute within my modest means to the overall spirit of open source programming.

        References:

        [1] A tutorial on hidden markov models and selected applications in speech recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 (1989), pp. 257-286.

        [2] Map-Reduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. In NIPS (2006), pp. 281-288.

        [3] Data-Intensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. Morgan & Claypool 2010.

        [4] http://flexcrfs.sourceforge.net/#Case_Study

        [5] http://www.rcsb.org/pdb/home/home.do

        [6] Beyond rotamers: a generative, probabilistic model of side chains in proteins by Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE, Hamelryck T. BMC Bioinformatics. 2010 Jun 5.

        [7] M. Toussaint and A. Storkey. Probabilistic inference for solving discrete and continuous state Markov Decision
        Processes. In ICML, 2006.
        Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

        Student Name: Dhruv Kumar

        Student E-mail: dkumar@ecs.umass.edu

        Organization/Project: Apache Mahout

        Assigned Mentor:

        Proposal Abstract:

        The Baum-Welch algorithm is commonly used for training a Hidden Markov Model because of its superior numerical stability and its ability to guarantee the discovery of a locally maximum, Maximum Likelihood Estimator, in the presence of incomplete training data. Currently, Apache Mahout has a sequential implementation of the Baum-Welch which cannot be scaled to train over large data sets. This restriction reduces the quality of training and as an effect, constraints generalization of the learned model in production environments. This project proposes to extend Mahout's sequential implementation of the Baum-Welch to a parallel, distributed version using the Map-Reduce programming framework to allow training at a large scale for enhanced model fitting.

        Detailed Description:

        Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool for applications generating temporal or spatial sequential data. Relative simplicity of implementation, combined with their ability to discover latent domain knowledge have made them very popular in diverse fields such as DNA sequence alignment, gene discovery, handwriting analysis, voice recognition, computer vision, language translation and parts-of-speech tagging.

        A HMM is defined as a tuple (S, O, Theta) where S is a finite set of unobservable, hidden states emitting symbols from a finite observable vocabulary set O according to a probabilistic model Theta. The parameters of the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic transition matrix of the hidden states of size |S| X |S|. The elements a_(i,j) of A specify the probability of transitioning from a state i to state j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose elements b_(s, o) provide the probability that a symbol o will be emitted from the hidden state s. The elements pi_(s) of the |S| length vector Pi determine the probability that the system starts in the hidden state s. The transitions of hidden states are unobservable and follow the Markov property of memorylessness.

        Rabiner [1] defined three main problems for HMMs:

        1. Evaluation: Given the complete model (S, O, Theta) and a subset of the observation sequence, determine the probability that the model generated the observed sequence. This is useful for evaluating the quality of the model and is solved using the so called Forward algorithm.

        2. Decoding: Given the complete model (S, O, Theta) and an observation sequence, determine the hidden state sequence which generated the observed sequence. This can be viewed as an inference problem where the model and observed sequence are used to predict the value of the unobservable random variables. The backward algorithm, also known as the Viterbi decoding algorithm is used for predicting the hidden state sequence.

        3. Training: Given the set of hidden states S, the set of observation vocabulary O and the observation sequence, determine the parameters (A, B, Pi) of the model Theta. This problem can be viewed as a statistical machine learning problem of model fitting to a large set of training data. The Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and the Viterbi training algorithm are commonly used for model fitting.

        In general, the quality of HMM training can be improved by employing large training vectors but currently, Mahout only supports sequential versions of HMM trainers which are incapable of scaling. Among the Viterbi and the Baum-Welch training methods, the Baum-Welch algorithm is superior, accurate, and a better candidate for a parallel implementation for two reasons:
        (1) The BW is numerically stable and provides a guaranteed discovery of the locally maximum, Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In Viterbi training, the MLE is approximated in order to reduce computation time.
        (2) The BW belongs to the general class of Expectation Maximization (EM) algorithms which naturally fit into the Map-Reduce framework [2].

        Hence, this project proposes to extend Mahout's current sequential implementation of the Baum-Welch HMM trainer to a scalable, distributed case. Since the distributed version of the BW will use the sequential implementations of the Forward and the Backward algorithms to compute the alpha and the beta factors in each iteration, a lot of existing HMM training code will be reused. Specifically, the parallel implementation of the BW algorithm on Map Reduce has been elaborated at great length in [3] by viewing it as a specific case of the Expectation-Maximization algorithm and will be followed for implementation in this project.

        The BW EM algorithm iteratively refines the model's parameters and consists of two distinct steps in each iteration--Expectation and Maximization. In the distributed case, the Expectation step is computed by the mappers and the reducers, while the Maximization is handled by the reducers. Starting from an initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is input to the algorithm, and the end result Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a user specified convergence condition expressed as a fixpoint or when the number of iterations exceeds a user defined value.

        Expectation computes the posterior probability of each latent variable for each observed variable, weighed by the relative frequency of the observed variable in the input split. The mappers process independent training instances and emit expected state transition and emission counts using the forward-backward algorithm. The reducers finish Expectation by aggregating the expected counts. The input to a mapper consists of (k, v_o) pairs where k is a unique key and v_o is a string of observed symbols. For each training instance, the mappers emit the same set of keys corresponding to the following three optimization problems to be solved during the Maximization, and their values in a hash-map:
        (1) Expected number of times a hidden state is reached (Pi).
        (2) Number of times each observable symbol is generated by each hidden state (B).
        (3) Number of transitions between each pair of states in the hidden state space (A).

        The M step computes the updated Theta(i+1) from the values generated during the E part. This involves aggregating the values (as hash-maps) for each key corresponding to one of the optimization problems. The aggregation summarizes the statistics necessary to compute a subset of the parameters for the next EM iteration. The optimal parameters for the next iteration are arrived by computing the relative frequency of each event with respect to its expected count at the current iteration. The emitted optimal parameters by each reducer are written to the HDFS and are fed to the mappers in the next iteration.

        The project can be subdivided into distinct tasks of programming, testing and documenting the driver, mapper, reducer and the combiner with the Expectation and Maximization parts split between them. For each of these tasks, a new class will be programmed, unit tested and documented within the org.apache.mahout.classifier.sequencelearning.hmm package. A list of milestones, associated deliverable and high level implementation details is given below.

        Time-line: April 26 - Aug 15.

        Milestones:

        April 26 - May 22 (4 weeks): Pre-coding stage. Open communication with my mentor, refine the project's plan and requirements, understand the community's code styling requirements, expand the knowledge on Hadoop and Mahout internals. Thoroughly familiarize with the classes within the classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and math packages.

        May 23 - June 3 (2 weeks): Work on Driver. Implement, test and document the class HmmDriver by extending the AbstractJob class and by reusing the code from the KMeansDriver class.

        June 3 - July 1 (4 weeks): Work on Mapper. Implement, test and document the class HmmMapper. The HmmMapper class will include setup() and map() methods. The setup() method will read in the HmmModel and the parameter values obtained from the previous iteration. The map() method will call the HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() and complete the Expectation step partially.

        July 1 - July 15 (2 weeks): Work on Reducer. Implement, test and document the class HmmReducer. The reducer will complete the Expectation step by summing over all the occurences emitted by the mappers for the three optimization problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if possible. Also, mid-term review.

        July 15 - July 29 (2 weeks): Work on Combiner. Implement, test and document the class HmmCombiner. The combiner will reduce the network traffic and improve efficiency by aggregating the values for each of the three keys corresponding to each of the optimization problems for the Maximization stage in reducers. Look at the possibility of code reuse from the KMeansCombiner class.

        July 29 - August 15 (2 weeks): Final touches. Test the mapper, reducer, combiner and driver together. Give an example demonstrating the new parallel BW algorithm by employing the parts-of-speech tagger data set also used by the sequential BW [4]. Tidy up code and fix loose ends, finish wiki documentation.

        Additional Information:

        I am in the final stages of finishing my Master's degree in Electrical and Computer Engineering from the University of Massachusetts Amherst. Working under the guidance of Prof. Wayne Burleson, as part of my Master's research work, I have applied the theory of Markov Decision Process (MDP) to increase the duration of service of mobile computers. This semester I am involved with two course projects involving machine learning over large data sets. In the Bioinformatics class, I am mining the RCSB Protein Data Bank [5] to learn the dependence of side chain geometry on a protein's secondary structure, and comparing it with the Dynamic Bayesian Network approach used in [6]. In another project for the Online Social Networks class, I am using reinforcement learning to build an online recommendation system by reformulating MDP optimal policy search as an EM problem [7] and employing Map Reduce (extending Mahout) to arrive at it in a scalable, distributed manner.
        I owe much to the open source community as all my research experiments have only been possible due to the freely available Linux distributions, performance analyzers, scripting languages and associated documentation. After joining the Apache Mahout's developer mailing list a few weeks ago, I have found the community extremely vibrant, helpful and welcoming. If selected, I feel that the GSOC 2011 project will be a great learning experience for me from both a technical and professional standpoint and will also allow me to contribute within my modest means to the overall spirit of open source programming and Machine Learning.

        References:

        [1] A tutorial on hidden markov models and selected applications in speech recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 (1989), pp. 257-286.

        [2] Map-Reduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. In NIPS (2006), pp. 281-288.

        [3] Data-Intensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. Morgan & Claypool 2010.

        [4] http://flexcrfs.sourceforge.net/#Case_Study

        [5] http://www.rcsb.org/pdb/home/home.do

        [6] Beyond rotamers: a generative, probabilistic model of side chains in proteins by Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE, Hamelryck T. BMC Bioinformatics. 2010 Jun 5.

        [7] Probabilistic inference for solving discrete and continuous state Markov Decision Processes by M. Toussaint and A. Storkey. ICML, 2006.
        Dhruv Kumar made changes -
        Description Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

        Student Name: Dhruv Kumar

        Student E-mail: dkumar@ecs.umass.edu

        Organization/Project: Apache Mahout

        Assigned Mentor:

        Proposal Abstract:

        The Baum-Welch algorithm is commonly used for training a Hidden Markov Model as it is numerically stable and provides a guaranteed discovery of the Maximum Likelihood Estimator in the presence of incomplete data. Currently, Apache Mahout has a sequential implementation of the Baum-Welch which cannot be scaled to train over large data sets. This project proposes to extend the sequential implementation of the Baum-Welch to a parallel, distributed version using the Map Reduce programming framework to allow scalable Hidden Markov Model training.

        Detailed Description:

        Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool for applications generating temporal or spatial sequential data. Their relative simplicity of implementation and their ability to discover latent domain knowledge have made them very popular in fields such as DNA sequence alignment, handwriting analysis, voice recognition, computer vision and parts-of-speech tagging.

        A HMM is defined as a tuple (S, O, Theta) where S is a finite set of unobservable, hidden states emitting symbols from a finite observable vocabulary set O according to a probabilistic model Theta. The parameters of the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic transition matrix of the hidden states of size |S| X |S|. The elements a_(i,j) of A specify the probability of transitioning from a state i to state j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose elements b_(s, o) provide the probability that a symbol o will be emitted from the hidden state s. The elements pi_(s) of the |S| length vector Pi determine the probability that the system starts in the hidden state s. The transitions of hidden states are unobservable and follow the Markov property of memorylessness.

        Rabiner [1] defined three main problems for HMMs:

        1. Evaluation: Given the complete model (S, O, Theta) and a subset of the observation sequence, determine the probability that the model generated the observed sequence. This is useful for determining the quality of the model and is solved using the so called Forward algorithm.

        2. Decoding: Given the complete model (S, O, Theta) and an observation sequence, determine the hidden state sequence which generated the observed sequence. This can be viewed as an inference problem where the model and observed sequence are used to predict the value of the unobservable random variables. The backward algorithm, also known as the Viterbi decoding algorithm is used for predicting the hidden state sequence.

        3. Training: Given the set of hidden states S, the set of observation vocabulary O and the observation sequence, determine the parameters (A, B, Pi) of the model Theta. This problem can be viewed as a statistical machine learning problem of model fitting to a large set of training data. The Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and the Viterbi training algorithm are commonly used for model fitting.

        In general, the quality of HMM training can be improved by employing large training vectors but currently, Mahout only supports sequential versions of HMM trainers which are incapable of scaling. Among the Viterbi and the Baum-Welch training methods, the Baum-Welch algorithm is slower but more accurate, and a better candidate for a parallel implementation for two reasons:
        (1) The BW is more numerically stable and provides a guaranteed local maximum of the Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In Viterbi training, the MLE is approximated in order to reduce computation time.
        (2) The BW belongs to the general class of Expectation Maximization (EM) algorithms which naturally fit into the Map Reduce framework [2].

        Hence, this project proposes to extend Mahout's current sequential implementation of the Baum-Welch HMM trainer to a scalable, distributed case. Since the distributed version of the BW will use the sequential implementations of the Forward and the Backward algorithms to compute the alpha and the beta factors in each iteration, a lot of existing HMM training code will be reused. Specifically, the parallel implementation of the BW algorithm on Map Reduce has been elaborated at great length in [3] by viewing it as a specific case of the Expectation-Maximization algorithm and will be followed for implementation in this project.

        The BW EM algorithm iteratively refines the model's parameters and consists of two distinct steps in each iteration--Expectation and Maximization. In the distributed case, the Expectation step is computed by the mappers and the reducers, while the Maximization is handled by the reducers. Starting from an initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is input to the algorithm, and the end result Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a user specified convergence condition expressed as a fixpoint or when the number of iterations exceeds a user defined value.

        Expectation computes the posterior probability of each latent variable for each observed variable, weighed by the relative frequency of the observed variable in the input split. The mappers process independent training instances and emit expected state transition and emission counts using the forward-backward algorithm. The reducers finish Expectation by aggregating the expected counts. The input to a mapper consists of (k, v_o) pairs where k is a unique key and v_o is a string of observed symbols. For each training instance, the mappers emit the same set of keys corresponding to the following three optimization problems to be solved during the Maximization, and their values in a hash-map:
        (1) Expected number of times a hidden state is reached (Pi).
        (2) Number of times each observable symbol is generated by each hidden state (B).
        (3) Number of transitions between each pair of states in the hidden state space (A).

        The M step computes the updated Theta(i+1) from the values generated during the E part. This involves aggregating the values (as hash-maps) for each key corresponding to one of the optimization problems. The aggregation summarizes the statistics necessary to compute a subset of the parameters for the next EM iteration. The optimal parameters for the next iteration are arrived by computing the relative frequency of each event with respect to its expected count at the current iteration. The emitted optimal parameters by each reducer are written to the HDFS and are fed to the mappers in the next iteration.

        The project can be subdivided into distinct tasks of programming, testing and documenting the driver, mapper, reducer and the combiner with the Expectation and Maximization parts split between them. For each of these tasks, a new class will be programmed, unit tested and documented within the org.apache.mahout.classifier.sequencelearning.hmm package. A list of milestones, associated deliverable and high level implementation details is given below.

        Time-line: April 26 - Aug 15.

        Milestones:

        April 26 - May 22 (4 weeks): Pre-coding stage. Open communication with my mentor, refine the project's plan and requirements, understand the community's code styling requirements, expand the knowledge on Hadoop and Mahout internals. Thoroughly familiarize with the classes within the classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and math packages. Write unit tests against the exisitng Mahout code.

        May 23 - June 3 (2 weeks): Work on Driver. Implement, test and document the class HmmDriver by extending the AbstractJob class and by reusing the code from the KMeansDriver class.

        June 3 - July 1 (4 weeks): Work on Mapper. Implement, test and document the class HmmMapper. The HmmMapper class will include setup() and map() methods. The setup() method will read in the HmmModel and the parameter values obtained from the previous iteration. The map() method will call the HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() and complete the Expectation step partially.

        July 1 - July 15 (2 weeks): Work on Reducer. Implement, test and document the class HmmReducer. The reducer will complete the Expectation step by summing over all the occurences emitted by the mappers for the three optimization problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if possible. Also, mid-term review.

        July 15 - July 29 (2 weeks): Work on Combiner. Implement, test and document the class HmmCombiner. The combiner will reduce the network traffic and improve efficiency by aggregating the values for each of the three keys corresponding to each of the optimization problems for the Maximization stage in reducers. Look at the possibility of code reuse from the KMeansCombiner class.

        July 29 - August 15 (2 weeks): Final touches. Test the mapper, reducer, combiner and driver together. Give an example demonstrating the new parallel BW algorithm by employing the parts-of-speech tagger data set also used by the sequential BW [4]. Tidy up code and fix loose ends, finish wiki documentation.

        Additional Information:

        I am in the final stages of finishing my Master's degree in Electrical and Computer Engineering from the University of Massachusetts Amherst. Working under the guidance of Prof. Wayne Burleson, as part of my Master's research work, I have applied the theory of Markov Decision Process (MDP) to increase the duration of service of mobile computers. This semester I am involved with two course projects involving machine learning over large data sets. In the Bioinformatics class, I am mining the RCSB Protein Data Bank [5] to learn the dependence of side chain geometry on a protein's secondary structure, and comparing it with the Dynamic Bayesian Network approach used in [6]. In another project for the Online Social Networks class, I am using reinforcement learning to build an online recommendation system by reformulating MDP optimal policy search as an EM problem [7] and employing Map Reduce (extending Mahout) to arrive at it in a scalable, distributed manner.
        I owe much to the open source community as all my research experiments have only been possible due to the freely available Linux distributions, performance analyzers, scripting languages and documentation. Since joining the Apache dev mailing list, I have found the Apache Mahout's developer community vibrant, helpful and welcoming. If selected, I feel that the GSOC 2011 project will be a great learning experience for me from both a technical and professional standpoint and also allow me to contribute within my modest means to the overall spirit of open source programming.

        References:

        [1] A tutorial on hidden markov models and selected applications in speech recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 (1989), pp. 257-286.

        [2] Map-Reduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. In NIPS (2006), pp. 281-288.

        [3] Data-Intensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. Morgan & Claypool 2010.

        [4] http://flexcrfs.sourceforge.net/#Case_Study

        [5] http://www.rcsb.org/pdb/home/home.do

        [6] Beyond rotamers: a generative, probabilistic model of side chains in proteins by Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE, Hamelryck T. BMC Bioinformatics. 2010 Jun 5.

        [7] M. Toussaint and A. Storkey. Probabilistic inference for solving discrete and continuous state Markov Decision
        Processes. In ICML, 2006.
        Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

        Student Name: Dhruv Kumar

        Student E-mail: dkumar@ecs.umass.edu

        Organization/Project: Apache Mahout

        Assigned Mentor:

        Proposal Abstract:

        The Baum-Welch algorithm is commonly used for training a Hidden Markov Model as it is numerically stable and provides a guaranteed discovery of the Maximum Likelihood Estimator in the presence of incomplete data. Currently, Apache Mahout has a sequential implementation of the Baum-Welch which cannot be scaled to train over large data sets. This project proposes to extend the sequential implementation of the Baum-Welch to a parallel, distributed version using the Map Reduce programming framework to allow scalable Hidden Markov Model training.

        Detailed Description:

        Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool for applications generating temporal or spatial sequential data. Their relative simplicity of implementation and their ability to discover latent domain knowledge have made them very popular in fields such as DNA sequence alignment, handwriting analysis, voice recognition, computer vision and parts-of-speech tagging.

        A HMM is defined as a tuple (S, O, Theta) where S is a finite set of unobservable, hidden states emitting symbols from a finite observable vocabulary set O according to a probabilistic model Theta. The parameters of the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic transition matrix of the hidden states of size |S| X |S|. The elements a_(i,j) of A specify the probability of transitioning from a state i to state j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose elements b_(s, o) provide the probability that a symbol o will be emitted from the hidden state s. The elements pi_(s) of the |S| length vector Pi determine the probability that the system starts in the hidden state s. The transitions of hidden states are unobservable and follow the Markov property of memorylessness.

        Rabiner [1] defined three main problems for HMMs:

        1. Evaluation: Given the complete model (S, O, Theta) and a subset of the observation sequence, determine the probability that the model generated the observed sequence. This is useful for determining the quality of the model and is solved using the so called Forward algorithm.

        2. Decoding: Given the complete model (S, O, Theta) and an observation sequence, determine the hidden state sequence which generated the observed sequence. This can be viewed as an inference problem where the model and observed sequence are used to predict the value of the unobservable random variables. The backward algorithm, also known as the Viterbi decoding algorithm is used for predicting the hidden state sequence.

        3. Training: Given the set of hidden states S, the set of observation vocabulary O and the observation sequence, determine the parameters (A, B, Pi) of the model Theta. This problem can be viewed as a statistical machine learning problem of model fitting to a large set of training data. The Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and the Viterbi training algorithm are commonly used for model fitting.

        In general, the quality of HMM training can be improved by employing large training vectors but currently, Mahout only supports sequential versions of HMM trainers which are incapable of scaling. Among the Viterbi and the Baum-Welch training methods, the Baum-Welch algorithm is slower but more accurate, and a better candidate for a parallel implementation for two reasons:
        (1) The BW is more numerically stable and provides a guaranteed local maximum of the Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In Viterbi training, the MLE is approximated in order to reduce computation time.
        (2) The BW belongs to the general class of Expectation Maximization (EM) algorithms which naturally fit into the Map Reduce framework [2].

        Hence, this project proposes to extend Mahout's current sequential implementation of the Baum-Welch HMM trainer to a scalable, distributed case. Since the distributed version of the BW will use the sequential implementations of the Forward and the Backward algorithms to compute the alpha and the beta factors in each iteration, a lot of existing HMM training code will be reused. Specifically, the parallel implementation of the BW algorithm on Map Reduce has been elaborated at great length in [3] by viewing it as a specific case of the Expectation-Maximization algorithm and will be followed for implementation in this project.

        The BW EM algorithm iteratively refines the model's parameters and consists of two distinct steps in each iteration--Expectation and Maximization. In the distributed case, the Expectation step is computed by the mappers and the reducers, while the Maximization is handled by the reducers. Starting from an initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is input to the algorithm, and the end result Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a user specified convergence condition expressed as a fixpoint or when the number of iterations exceeds a user defined value.

        Expectation computes the posterior probability of each latent variable for each observed variable, weighed by the relative frequency of the observed variable in the input split. The mappers process independent training instances and emit expected state transition and emission counts using the forward-backward algorithm. The reducers finish Expectation by aggregating the expected counts. The input to a mapper consists of (k, v_o) pairs where k is a unique key and v_o is a string of observed symbols. For each training instance, the mappers emit the same set of keys corresponding to the following three optimization problems to be solved during the Maximization, and their values in a hash-map:
        (1) Expected number of times a hidden state is reached (Pi).
        (2) Number of times each observable symbol is generated by each hidden state (B).
        (3) Number of transitions between each pair of states in the hidden state space (A).

        The M step computes the updated Theta(i+1) from the values generated during the E part. This involves aggregating the values (as hash-maps) for each key corresponding to one of the optimization problems. The aggregation summarizes the statistics necessary to compute a subset of the parameters for the next EM iteration. The optimal parameters for the next iteration are arrived by computing the relative frequency of each event with respect to its expected count at the current iteration. The emitted optimal parameters by each reducer are written to the HDFS and are fed to the mappers in the next iteration.

        The project can be subdivided into distinct tasks of programming, testing and documenting the driver, mapper, reducer and the combiner with the Expectation and Maximization parts split between them. For each of these tasks, a new class will be programmed, unit tested and documented within the org.apache.mahout.classifier.sequencelearning.hmm package. A list of milestones, associated deliverable and high level implementation details is given below.

        Time-line: April 26 - Aug 15.

        Milestones:

        April 26 - May 22 (4 weeks): Pre-coding stage. Open communication with my mentor, refine the project's plan and requirements, understand the community's code styling requirements, expand the knowledge on Hadoop and Mahout internals. Thoroughly familiarize with the classes within the classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and math packages.

        May 23 - June 3 (2 weeks): Work on Driver. Implement, test and document the class HmmDriver by extending the AbstractJob class and by reusing the code from the KMeansDriver class.

        June 3 - July 1 (4 weeks): Work on Mapper. Implement, test and document the class HmmMapper. The HmmMapper class will include setup() and map() methods. The setup() method will read in the HmmModel and the parameter values obtained from the previous iteration. The map() method will call the HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() and complete the Expectation step partially.

        July 1 - July 15 (2 weeks): Work on Reducer. Implement, test and document the class HmmReducer. The reducer will complete the Expectation step by summing over all the occurences emitted by the mappers for the three optimization problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if possible. Also, mid-term review.

        July 15 - July 29 (2 weeks): Work on Combiner. Implement, test and document the class HmmCombiner. The combiner will reduce the network traffic and improve efficiency by aggregating the values for each of the three keys corresponding to each of the optimization problems for the Maximization stage in reducers. Look at the possibility of code reuse from the KMeansCombiner class.

        July 29 - August 15 (2 weeks): Final touches. Test the mapper, reducer, combiner and driver together. Give an example demonstrating the new parallel BW algorithm by employing the parts-of-speech tagger data set also used by the sequential BW [4]. Tidy up code and fix loose ends, finish wiki documentation.

        Additional Information:

        I am in the final stages of finishing my Master's degree in Electrical and Computer Engineering from the University of Massachusetts Amherst. Working under the guidance of Prof. Wayne Burleson, as part of my Master's research work, I have applied the theory of Markov Decision Process (MDP) to increase the duration of service of mobile computers. This semester I am involved with two course projects involving machine learning over large data sets. In the Bioinformatics class, I am mining the RCSB Protein Data Bank [5] to learn the dependence of side chain geometry on a protein's secondary structure, and comparing it with the Dynamic Bayesian Network approach used in [6]. In another project for the Online Social Networks class, I am using reinforcement learning to build an online recommendation system by reformulating MDP optimal policy search as an EM problem [7] and employing Map Reduce (extending Mahout) to arrive at it in a scalable, distributed manner.
        I owe much to the open source community as all my research experiments have only been possible due to the freely available Linux distributions, performance analyzers, scripting languages and documentation. Since joining the Apache dev mailing list, I have found the Apache Mahout's developer community vibrant, helpful and welcoming. If selected, I feel that the GSOC 2011 project will be a great learning experience for me from both a technical and professional standpoint and also allow me to contribute within my modest means to the overall spirit of open source programming.

        References:

        [1] A tutorial on hidden markov models and selected applications in speech recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 (1989), pp. 257-286.

        [2] Map-Reduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. In NIPS (2006), pp. 281-288.

        [3] Data-Intensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. Morgan & Claypool 2010.

        [4] http://flexcrfs.sourceforge.net/#Case_Study

        [5] http://www.rcsb.org/pdb/home/home.do

        [6] Beyond rotamers: a generative, probabilistic model of side chains in proteins by Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE, Hamelryck T. BMC Bioinformatics. 2010 Jun 5.

        [7] M. Toussaint and A. Storkey. Probabilistic inference for solving discrete and continuous state Markov Decision
        Processes. In ICML, 2006.
        Hide
        Dhruv Kumar added a comment -

        Updated JIRA proposal: added references, fixed one typo, made the task list more clear. Updated on GSOC.

        Show
        Dhruv Kumar added a comment - Updated JIRA proposal: added references, fixed one typo, made the task list more clear. Updated on GSOC.
        Dhruv Kumar made changes -
        Description Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

        Student Name: Dhruv Kumar

        Student E-mail: dkumar@ecs.umass.edu

        Organization/Project: Apache Mahout

        Assigned Mentor:

        Proposal Abstract:

        The Baum-Welch algorithm is commonly used for training a Hidden Markov Model as it is numerically stable and provides a guaranteed discovery of the Maximum Likelihood Estimator in the presence of incomplete data. Currently, Apache Mahout has a sequential implementation of the Baum-Welch which cannot be scaled to train over large data sets. This project proposes to extend the sequential implementation of the Baum-Welch to a parallel, distributed version using the Map Reduce programming framework to allow scalable Hidden Markov Model training.

        Detailed Description:

        Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool for applications generating temporal or spatial sequential data. Their relative simplicity of implementation and their ability to discover latent domain knowledge have made them very popular in fields such as DNA sequence alignment, handwriting analysis, voice recognition, computer vision and parts-of-speech tagging.

        A HMM is defined as a tuple (S, O, Theta) where S is a finite set of unobservable, hidden states emitting symbols from a finite observable vocabulary set O according to a probabilistic model Theta. The parameters of the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic transition matrix of the hidden states of size |S| X |S|. The elements a_(i,j) of A specify the probability of transitioning from a state i to state j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose elements b_(s, o) provide the probability that a symbol o will be emitted from the hidden state s. The elements pi_(s) of the |S| length vector Pi determine the probability that the system starts in the hidden state s. The transitions of hidden states are unobservable and follow the Markov property of memorylessness.

        Rabiner [1] defined three main problems for HMMs:

        1. Evaluation: Given the complete model (S, O, Theta) and a subset of the observation sequence, determine the probability that the model generated the observed sequence. This is useful for determining the quality of the model and is solved using the so called Forward algorithm.

        2. Decoding: Given the complete model (S, O, Theta) and an observation sequence, determine the hidden state sequence which generated the observed sequence. This can be viewed as an inference problem where the model and observed sequence are used to predict the value of the unobservable random variables. The backward algorithm, also known as the Viterbi decoding algorithm is used for predicting the hidden state sequence.

        3. Training: Given the set of hidden states S, the set of observation vocabulary O and the observation sequence, determine the parameters (A, B, Pi) of the model Theta. This problem can be viewed as a statistical machine learning problem of model fitting to a large set of training data. The Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and the Viterbi training algorithm are commonly used for model fitting.

        In general, the quality of HMM training can be improved by employing large training vectors but currently, Mahout only supports sequential versions of HMM trainers which are incapable of scaling. Among the Viterbi and the Baum-Welch training methods, the Baum-Welch algorithm is slower but more accurate, and a better candidate for a parallel implementation for two reasons:
        (1) The BW is more numerically stable and provides a guaranteed local maximum of the Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In Viterbi training, the MLE is approximated in order to reduce computation time.
        (2) The BW belongs to the general class of Expectation Maximization (EM) algorithms which naturally fit into the Map Reduce framework [2].

        Hence, this project proposes to extend Mahout's current sequential implementation of the Baum-Welch HMM trainer to a scalable, distributed case. Since the distributed version of the BW will use the sequential implementations of the Forward and the Backward algorithms to compute the alpha and the beta factors in each iteration, a lot of existing HMM training code will be reused. Specifically, the parallel implementation of the BW algorithm on Map Reduce has been elaborated at great length in [3] by viewing it as a specific case of the Expectation-Maximization algorithm and will be followed for implementation in this project.

        The BW EM algorithm iteratively refines the model's parameters and consists of two distinct steps in each iteration--Expectation and Maximization. In the distributed case, the Expectation step is computed by the mappers and the reducers, while the Maximization is handled by the reducers. Starting from an initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is input to the algorithm, and the end result Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a user specified convergence condition expressed as a fixpoint or when the number of iterations exceeds a user defined value.

        Expectation computes the posterior probability of each latent variable for each observed variable, weighed by the relative frequency of the observed variable in the input split. The mappers process independent training instances and emit expected state transition and emission counts using the forward-backward algorithm. The reducers finish Expectation by aggregating the expected counts. The input to a mapper consists of (k, v_o) pairs where k is a unique key and v_o is a string of observed symbols. For each training instance, the mappers emit the same set of keys corresponding to the following three optimization problems to be solved during the Maximization, and their values in a hash-map:
        (1) Expected number of times a hidden state is reached (Pi).
        (2) Number of times each observable symbol is generated by each hidden state (B).
        (3) Number of transitions between each pair of states in the hidden state space (A).

        The M step computes the updated Theta(i+1) from the values generated during the E part. This involves aggregating the values (as hash-maps) for each key corresponding to one of the optimization problems. The aggregation summarizes the statistics necessary to compute a subset of the parameters for the next EM iteration. The optimal parameters for the next iteration are arrived by computing the relative frequency of each event with respect to its expected count at the current iteration. The emitted optimal parameters by each reducer are written to the HDFS and are fed to the mappers in the next iteration.

        The project can be subdivided into distinct tasks of programming, testing and documenting the driver, mapper, reducer and the combiner with the Expectation and Maximization parts split between them. For each of these tasks, a new class will be programmed, unit tested and documented within the org.apache.mahout.classifier.sequencelearning.hmm package. A list of milestones, associated deliverable and high level implementation details is given below.

        Time-line: April 26 - Aug 15.

        Milestones:

        April 26 - May 22 (4 weeks): This is the pre-coding stage. Open communication with my mentor, refine the project's plan and requirements, understand the community's code styling requirements, expand the knowledge on Hadoop and Mahout internals. Thoroughly familiarize with the classes within the classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and math packages. Write unit tests against the exisitng Mahout code.

        May 23 - June 3 (2 weeks): Implement, test and document the class HmmDriver by extending the AbstractJob class and by reusing the code from the KMeansDriver class.

        June 3 - July 1 (4 weeks): Implement, test and document the class HmmMapper. The HmmMapper class will include setup() and map() methods. The setup() method will read in the HmmModel and the parameter values obtained from the previous iteration. The map() method will call the HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() and complete the Expectation step partially.

        July 1 - July 15 (2 weeks): Implement, test and document the class HmmReducer. The reducer will complete the Expectation step by summing over all the occurences emitted by the mappers for the three optimization problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if possible.

        July 15 - July 29 (2 weeks): Implement, test and document the class HmmCombiner. The combiner will reduce the network traffic and improve efficiency by aggregating the values for each of the three keys corresponding to each of the optimization problems for the Maximization stage in reducers. Look at the possibility of code reuse from the KMeansCombiner class.

        July 29 - August 15 (2 weeks): Test the mapper, reducer, combiner and driver together. Give an example demonstrating the new parallel BW algorithm by employing the parts-of-speech tagger data set also used by the sequential BW [4]. Tidy up code and fix loose ends, finish wiki documentation.

        Additional Information:

        I am in the final stages of finishing my Master's degree in Electrical and Computer Engineering from the University of Massachusetts Amherst. Working under the guidance of Prof. Wayne Burleson, as part of my Master's research work, I have applied the theory of Markov Decision Process (MDP) to increase the duration of service of mobile computers. This semester I am involved with two course projects involving machine learning over large data sets. In the Bioinformatics class, I am mining the RCSB Protein Data Bank [5] to learn the dependence of side chain geometry on a protein's secondary structure, and comparing it with the Dynamic Bayesian Network approach used in [6]. In another project for the Online Social Networks class, I am using reinforcement learning to build an online recommendation system by reformulating MDP optimal policy search as an EM problem [7] and employing Map Reduce (extending Mahout) to arrive at it in a scalable, distributed manner.
        I owe much to the open source community as all my research experiments have only been possible due to the freely available Linux distributions, performance analyzers, scripting languages and documentation. Since joining the Apache dev mailing list, I have found the Apache Mahout's developer community vibrant, helpful and welcoming. If selected, I feel that the GSOC 2011 project will be a great learning experience for me from both a technical and professional standpoint and also allow me to contribute within my modest means to the overall spirit of open source programming.

        References:

        [1] A tutorial on hidden markov models and selected applications in speech recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 (1989), pp. 257-286.

        [2] Map-Reduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. In NIPS (2006), pp. 281-288.

        [3] Data-Intensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. Morgan & Claypool 2010.

        [4] http://flexcrfs.sourceforge.net/#Case_Study

        [5] http://www.rcsb.org/pdb/home/home.do

        [6] Beyond rotamers: a generative, probabilistic model of side chains in proteins by Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE, Hamelryck T. BMC Bioinformatics. 2010 Jun 5.

        [7] M. Toussaint and A. Storkey. Probabilistic inference for solving discrete and continuous state Markov Decision
        Processes. In ICML, 2006.
        Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

        Student Name: Dhruv Kumar

        Student E-mail: dkumar@ecs.umass.edu

        Organization/Project: Apache Mahout

        Assigned Mentor:

        Proposal Abstract:

        The Baum-Welch algorithm is commonly used for training a Hidden Markov Model as it is numerically stable and provides a guaranteed discovery of the Maximum Likelihood Estimator in the presence of incomplete data. Currently, Apache Mahout has a sequential implementation of the Baum-Welch which cannot be scaled to train over large data sets. This project proposes to extend the sequential implementation of the Baum-Welch to a parallel, distributed version using the Map Reduce programming framework to allow scalable Hidden Markov Model training.

        Detailed Description:

        Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool for applications generating temporal or spatial sequential data. Their relative simplicity of implementation and their ability to discover latent domain knowledge have made them very popular in fields such as DNA sequence alignment, handwriting analysis, voice recognition, computer vision and parts-of-speech tagging.

        A HMM is defined as a tuple (S, O, Theta) where S is a finite set of unobservable, hidden states emitting symbols from a finite observable vocabulary set O according to a probabilistic model Theta. The parameters of the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic transition matrix of the hidden states of size |S| X |S|. The elements a_(i,j) of A specify the probability of transitioning from a state i to state j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose elements b_(s, o) provide the probability that a symbol o will be emitted from the hidden state s. The elements pi_(s) of the |S| length vector Pi determine the probability that the system starts in the hidden state s. The transitions of hidden states are unobservable and follow the Markov property of memorylessness.

        Rabiner [1] defined three main problems for HMMs:

        1. Evaluation: Given the complete model (S, O, Theta) and a subset of the observation sequence, determine the probability that the model generated the observed sequence. This is useful for determining the quality of the model and is solved using the so called Forward algorithm.

        2. Decoding: Given the complete model (S, O, Theta) and an observation sequence, determine the hidden state sequence which generated the observed sequence. This can be viewed as an inference problem where the model and observed sequence are used to predict the value of the unobservable random variables. The backward algorithm, also known as the Viterbi decoding algorithm is used for predicting the hidden state sequence.

        3. Training: Given the set of hidden states S, the set of observation vocabulary O and the observation sequence, determine the parameters (A, B, Pi) of the model Theta. This problem can be viewed as a statistical machine learning problem of model fitting to a large set of training data. The Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and the Viterbi training algorithm are commonly used for model fitting.

        In general, the quality of HMM training can be improved by employing large training vectors but currently, Mahout only supports sequential versions of HMM trainers which are incapable of scaling. Among the Viterbi and the Baum-Welch training methods, the Baum-Welch algorithm is slower but more accurate, and a better candidate for a parallel implementation for two reasons:
        (1) The BW is more numerically stable and provides a guaranteed local maximum of the Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In Viterbi training, the MLE is approximated in order to reduce computation time.
        (2) The BW belongs to the general class of Expectation Maximization (EM) algorithms which naturally fit into the Map Reduce framework [2].

        Hence, this project proposes to extend Mahout's current sequential implementation of the Baum-Welch HMM trainer to a scalable, distributed case. Since the distributed version of the BW will use the sequential implementations of the Forward and the Backward algorithms to compute the alpha and the beta factors in each iteration, a lot of existing HMM training code will be reused. Specifically, the parallel implementation of the BW algorithm on Map Reduce has been elaborated at great length in [3] by viewing it as a specific case of the Expectation-Maximization algorithm and will be followed for implementation in this project.

        The BW EM algorithm iteratively refines the model's parameters and consists of two distinct steps in each iteration--Expectation and Maximization. In the distributed case, the Expectation step is computed by the mappers and the reducers, while the Maximization is handled by the reducers. Starting from an initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is input to the algorithm, and the end result Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a user specified convergence condition expressed as a fixpoint or when the number of iterations exceeds a user defined value.

        Expectation computes the posterior probability of each latent variable for each observed variable, weighed by the relative frequency of the observed variable in the input split. The mappers process independent training instances and emit expected state transition and emission counts using the forward-backward algorithm. The reducers finish Expectation by aggregating the expected counts. The input to a mapper consists of (k, v_o) pairs where k is a unique key and v_o is a string of observed symbols. For each training instance, the mappers emit the same set of keys corresponding to the following three optimization problems to be solved during the Maximization, and their values in a hash-map:
        (1) Expected number of times a hidden state is reached (Pi).
        (2) Number of times each observable symbol is generated by each hidden state (B).
        (3) Number of transitions between each pair of states in the hidden state space (A).

        The M step computes the updated Theta(i+1) from the values generated during the E part. This involves aggregating the values (as hash-maps) for each key corresponding to one of the optimization problems. The aggregation summarizes the statistics necessary to compute a subset of the parameters for the next EM iteration. The optimal parameters for the next iteration are arrived by computing the relative frequency of each event with respect to its expected count at the current iteration. The emitted optimal parameters by each reducer are written to the HDFS and are fed to the mappers in the next iteration.

        The project can be subdivided into distinct tasks of programming, testing and documenting the driver, mapper, reducer and the combiner with the Expectation and Maximization parts split between them. For each of these tasks, a new class will be programmed, unit tested and documented within the org.apache.mahout.classifier.sequencelearning.hmm package. A list of milestones, associated deliverable and high level implementation details is given below.

        Time-line: April 26 - Aug 15.

        Milestones:

        April 26 - May 22 (4 weeks): Pre-coding stage. Open communication with my mentor, refine the project's plan and requirements, understand the community's code styling requirements, expand the knowledge on Hadoop and Mahout internals. Thoroughly familiarize with the classes within the classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and math packages. Write unit tests against the exisitng Mahout code.

        May 23 - June 3 (2 weeks): Work on Driver. Implement, test and document the class HmmDriver by extending the AbstractJob class and by reusing the code from the KMeansDriver class.

        June 3 - July 1 (4 weeks): Work on Mapper. Implement, test and document the class HmmMapper. The HmmMapper class will include setup() and map() methods. The setup() method will read in the HmmModel and the parameter values obtained from the previous iteration. The map() method will call the HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() and complete the Expectation step partially.

        July 1 - July 15 (2 weeks): Work on Reducer. Implement, test and document the class HmmReducer. The reducer will complete the Expectation step by summing over all the occurences emitted by the mappers for the three optimization problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if possible. Also, mid-term review.

        July 15 - July 29 (2 weeks): Work on Combiner. Implement, test and document the class HmmCombiner. The combiner will reduce the network traffic and improve efficiency by aggregating the values for each of the three keys corresponding to each of the optimization problems for the Maximization stage in reducers. Look at the possibility of code reuse from the KMeansCombiner class.

        July 29 - August 15 (2 weeks): Final touches. Test the mapper, reducer, combiner and driver together. Give an example demonstrating the new parallel BW algorithm by employing the parts-of-speech tagger data set also used by the sequential BW [4]. Tidy up code and fix loose ends, finish wiki documentation.

        Additional Information:

        I am in the final stages of finishing my Master's degree in Electrical and Computer Engineering from the University of Massachusetts Amherst. Working under the guidance of Prof. Wayne Burleson, as part of my Master's research work, I have applied the theory of Markov Decision Process (MDP) to increase the duration of service of mobile computers. This semester I am involved with two course projects involving machine learning over large data sets. In the Bioinformatics class, I am mining the RCSB Protein Data Bank [5] to learn the dependence of side chain geometry on a protein's secondary structure, and comparing it with the Dynamic Bayesian Network approach used in [6]. In another project for the Online Social Networks class, I am using reinforcement learning to build an online recommendation system by reformulating MDP optimal policy search as an EM problem [7] and employing Map Reduce (extending Mahout) to arrive at it in a scalable, distributed manner.
        I owe much to the open source community as all my research experiments have only been possible due to the freely available Linux distributions, performance analyzers, scripting languages and documentation. Since joining the Apache dev mailing list, I have found the Apache Mahout's developer community vibrant, helpful and welcoming. If selected, I feel that the GSOC 2011 project will be a great learning experience for me from both a technical and professional standpoint and also allow me to contribute within my modest means to the overall spirit of open source programming.

        References:

        [1] A tutorial on hidden markov models and selected applications in speech recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 (1989), pp. 257-286.

        [2] Map-Reduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. In NIPS (2006), pp. 281-288.

        [3] Data-Intensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. Morgan & Claypool 2010.

        [4] http://flexcrfs.sourceforge.net/#Case_Study

        [5] http://www.rcsb.org/pdb/home/home.do

        [6] Beyond rotamers: a generative, probabilistic model of side chains in proteins by Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE, Hamelryck T. BMC Bioinformatics. 2010 Jun 5.

        [7] M. Toussaint and A. Storkey. Probabilistic inference for solving discrete and continuous state Markov Decision
        Processes. In ICML, 2006.
        Dhruv Kumar made changes -
        Description Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

        Student Name: Dhruv Kumar

        Student E-mail: dkumar@ecs.umass.edu

        Organization/Project: Apache Mahout

        Assigned Mentor:

        Proposal Abstract:

        The Baum-Welch algorithm is commonly used for training a Hidden Markov Model as it is numerically stable and provides a guaranteed discovery of the Maximum Likelihood Estimator in the presence of incomplete data. Currently, Apache Mahout has a sequential implementation of the Baum-Welch which cannot be scaled to train over large data sets. This project proposes to extend the sequential implementation of the Baum-Welch to a parallel, distributed version using the Map Reduce programming framework to allow scalable Hidden Markov Model training.

        Detailed Description:

        Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool for applications generating temporal or spatial sequential data. Their relative simplicity of implementation and their ability to discover latent domain knowledge have made them very popular in fields such as DNA sequence alignment, handwriting analysis, voice recognition, computer vision and parts-of-speech tagging.

        A HMM is defined as a tuple (S, O, Theta) where S is a finite set of unobservable, hidden states emitting symbols from a finite observable vocabulary set O according to a probabilistic model Theta. The parameters of the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic transition matrix of the hidden states of size |S| X |S|. The elements a_(i,j) of A specify the probability of transitioning from a state i to state j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose elements b_(s, o) provide the probability that a symbol o will be emitted from the hidden state s. The elements pi_(s) of the |S| length vector Pi determine the probability that the system starts in the hidden state s. The transitions of hidden states are unobservable and follow the Markov property of memorylessness.

        Rabiner [1] defined three main problems for HMMs:

        1. Evaluation: Given the complete model (S, O, Theta) and a subset of the observation sequence, determine the probability that the model generated the observed sequence. This is useful for determining the quality of the model and is solved using the so called Forward algorithm.

        2. Decoding: Given the complete model (S, O, Theta) and an observation sequence, determine the hidden state sequence which generated the observed sequence. This can be viewed as an inference problem where the model and observed sequence are used to predict the value of the unobservable random variables. The backward algorithm, also known as the Viterbi decoding algorithm is used for predicting the hidden state sequence.

        3. Training: Given the set of hidden states S, the set of observation vocabulary O and the observation sequence, determine the parameters (A, B, Pi) of the model Theta. This problem can be viewed as a statistical machine learning problem of model fitting to a large set of training data. The Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and the Viterbi training algorithm are commonly used for model fitting.

        In general, the quality of HMM training can be improved by employing large training vectors but currently, Mahout only supports sequential versions of HMM trainers which are incapable of scaling. Among the Viterbi and the Baum-Welch training methods, the Baum-Welch algorithm is slower but more accurate, and a better candidate for a parallel implementation for two reasons:
        (1) The BW is more numerically stable and provides a guaranteed local maximum of the Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In Viterbi training, the MLE is approximated in order to reduce computation time.
        (2) The BW belongs to the general class of Expectation Maximization (EM) algorithms which naturally fit into the Map Reduce framework [2].

        Hence, this project proposes to extend Mahout's current sequential implementation of the Baum-Welch HMM trainer to a scalable, distributed case. Since the distributed version of the BW will use the sequential implementations of the Forward and the Backward algorithms to compute the alpha and the beta factors in each iteration, a lot of existing HMM training code will be reused. Specifically, the parallel implementation of the BW algorithm on Map Reduce has been elaborated at great length in [3] by viewing it as a specific case of the Expectation-Maximization algorithm and will be followed for implementation in this project.

        The BW EM algorithm iteratively refines the model's parameters and consists of two distinct steps in each iteration--Expectation and Maximization. In the distributed case, the Expectation step is computed by the mappers and the reducers, while the Maximization is handled by the reducers. Starting from an initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is input to the algorithm, and the end result Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a user specified convergence condition expressed as a fixpoint or when the number of iterations exceeds a user defined value.

        Expectation computes the posterior probability of each latent variable for each observed variable, weighed by the relative frequency of the observed variable in the input split. The mappers process independent training instances and emit expected state transition and emission counts using the forward-backward algorithm. The reducers finish Expectation by aggregating the expected counts. The input to a mapper consists of (k, v_o) pairs where k is a unique key and v_o is a string of observed symbols. For each training instance, the mappers emit the same set of keys corresponding to the following three optimization problems to be solved during the Maximization, and their values in a hash-map:
        (1) Expected number of times a hidden state is reached (Pi).
        (2) Number of times each observable symbol is generated by each hidden state (B).
        (3) Number of transitions between each pair of states in the hidden state space (A).

        The M step computes the updated Theta(i+1) from the values generated during the E part. This involves aggregating the values (as hash-maps) for each key corresponding to one of the optimization problems. The aggregation summarizes the statistics necessary to compute a subset of the parameters for the next EM iteration. The optimal parameters for the next iteration are arrived by computing the relative frequency of each event with respect to its expected count at the current iteration. The emitted optimal parameters by each reducer are written to the HDFS and are fed to the mappers in the next iteration.

        The project can be subdivided into distinct tasks of programming, testing and documenting the driver, mapper, reducer and the combiner with the Expectation and Maximization parts split between them. For each of these tasks, a new class will be programmed, unit tested and documented within the org.apache.mahout.classifier.sequencelearning.hmm package. A list of milestones, associated deliverable and high level implementation details is given below.

        Time-line: April 26 - Aug 15.

        Milestones:

        April 26 - May 22 (4 weeks): This is the pre-coding stage. Open communication with my mentor, refine the project's plan and requirements, understand the community's code styling requirements, expand the knowledge on Hadoop and Mahout internals. Thoroughly familiarize with the classes within the classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and math packages. Write unit tests against the exisitng Mahout code.

        May 23 - June 3 (2 weeks): Implement, test and document the class HmmDriver by extending the AbstractJob class and by reusing the code from the KMeansDriver class.

        June 3 - July 1 (4 weeks): Implement, test and document the class HmmMapper. The HmmMapper class will include setup() and map() methods. The setup() method will read in the HmmModel and the parameter values obtained from the previous iteration. The map() method will call the HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() and complete the Expectation step partially.

        July 1 - July 15 (2 weeks): Implement, test and document the class HmmReducer. The reducer will complete the Espectation step by summing over all the occurences emitted by the mappers for the three optimization problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if possible.

        July 15 - July 29 (2 weeks): Implement, test and document the class HmmCombiner. The combiner will reduce the network traffic and improve efficiency by aggregating the values for each of the three keys corresponding to each of the optimization problems for the Maximization stage in reducers. Look at the possibility of code reuse from the KMeansCombiner class.

        July 29 - August 15 (2 weeks): Test the mapper, reducer, combiner and driver together. Give an example demonstrating the new parallel BW algorithm by employing the parts-of-speech tagger data set also used by the sequential BW [4]. Tidy up code and fix loose ends, finish wiki documentation.

        Additional Information:

        I am in the final stages of finishing my Master's degree in Electrical and Computer Engineering from the University of Massachusetts Amherst. Working under the guidance of Prof. Wayne Burleson, as part of my Master's research work, I have applied the theory of Markov Decision Process (MDP) to increase the duration of service of mobile computers. This semester I am involved with two course projects involving machine learning over large data sets. In a Bioinformatics class I am mining the RCSB Protein Data Bank to learn the dependence of side chain geometry on the protein's secondary structure. In another project for the Online Social Networks class, I am building an online recommendation system using the MDP.
        I owe much to the open source community as all my research experiments have only been possible due to the freely available Linux distributions, performance analyzers, scripting languages and documentation. Since joining the Apache dev mailing list, I have found the Apache Mahout's developer community vibrant, helpful and welcoming. If selected, I feel that the GSOC 2011 project will be a great learning experience for me from both a technical and professional standpoint and also allow me to contribute within my modest means to the overall spirit of open source programming.

        References:

        [1] A tutorial on hidden markov models and selected applications in speech recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 (1989), pp. 257-286.

        [2] Map-Reduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. In NIPS (2006), pp. 281-288.

        [3] Data-Intensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. Morgan & Claypool 2010.

        [4] http://flexcrfs.sourceforge.net/#Case_Study
        Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

        Student Name: Dhruv Kumar

        Student E-mail: dkumar@ecs.umass.edu

        Organization/Project: Apache Mahout

        Assigned Mentor:

        Proposal Abstract:

        The Baum-Welch algorithm is commonly used for training a Hidden Markov Model as it is numerically stable and provides a guaranteed discovery of the Maximum Likelihood Estimator in the presence of incomplete data. Currently, Apache Mahout has a sequential implementation of the Baum-Welch which cannot be scaled to train over large data sets. This project proposes to extend the sequential implementation of the Baum-Welch to a parallel, distributed version using the Map Reduce programming framework to allow scalable Hidden Markov Model training.

        Detailed Description:

        Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool for applications generating temporal or spatial sequential data. Their relative simplicity of implementation and their ability to discover latent domain knowledge have made them very popular in fields such as DNA sequence alignment, handwriting analysis, voice recognition, computer vision and parts-of-speech tagging.

        A HMM is defined as a tuple (S, O, Theta) where S is a finite set of unobservable, hidden states emitting symbols from a finite observable vocabulary set O according to a probabilistic model Theta. The parameters of the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic transition matrix of the hidden states of size |S| X |S|. The elements a_(i,j) of A specify the probability of transitioning from a state i to state j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose elements b_(s, o) provide the probability that a symbol o will be emitted from the hidden state s. The elements pi_(s) of the |S| length vector Pi determine the probability that the system starts in the hidden state s. The transitions of hidden states are unobservable and follow the Markov property of memorylessness.

        Rabiner [1] defined three main problems for HMMs:

        1. Evaluation: Given the complete model (S, O, Theta) and a subset of the observation sequence, determine the probability that the model generated the observed sequence. This is useful for determining the quality of the model and is solved using the so called Forward algorithm.

        2. Decoding: Given the complete model (S, O, Theta) and an observation sequence, determine the hidden state sequence which generated the observed sequence. This can be viewed as an inference problem where the model and observed sequence are used to predict the value of the unobservable random variables. The backward algorithm, also known as the Viterbi decoding algorithm is used for predicting the hidden state sequence.

        3. Training: Given the set of hidden states S, the set of observation vocabulary O and the observation sequence, determine the parameters (A, B, Pi) of the model Theta. This problem can be viewed as a statistical machine learning problem of model fitting to a large set of training data. The Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and the Viterbi training algorithm are commonly used for model fitting.

        In general, the quality of HMM training can be improved by employing large training vectors but currently, Mahout only supports sequential versions of HMM trainers which are incapable of scaling. Among the Viterbi and the Baum-Welch training methods, the Baum-Welch algorithm is slower but more accurate, and a better candidate for a parallel implementation for two reasons:
        (1) The BW is more numerically stable and provides a guaranteed local maximum of the Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In Viterbi training, the MLE is approximated in order to reduce computation time.
        (2) The BW belongs to the general class of Expectation Maximization (EM) algorithms which naturally fit into the Map Reduce framework [2].

        Hence, this project proposes to extend Mahout's current sequential implementation of the Baum-Welch HMM trainer to a scalable, distributed case. Since the distributed version of the BW will use the sequential implementations of the Forward and the Backward algorithms to compute the alpha and the beta factors in each iteration, a lot of existing HMM training code will be reused. Specifically, the parallel implementation of the BW algorithm on Map Reduce has been elaborated at great length in [3] by viewing it as a specific case of the Expectation-Maximization algorithm and will be followed for implementation in this project.

        The BW EM algorithm iteratively refines the model's parameters and consists of two distinct steps in each iteration--Expectation and Maximization. In the distributed case, the Expectation step is computed by the mappers and the reducers, while the Maximization is handled by the reducers. Starting from an initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is input to the algorithm, and the end result Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a user specified convergence condition expressed as a fixpoint or when the number of iterations exceeds a user defined value.

        Expectation computes the posterior probability of each latent variable for each observed variable, weighed by the relative frequency of the observed variable in the input split. The mappers process independent training instances and emit expected state transition and emission counts using the forward-backward algorithm. The reducers finish Expectation by aggregating the expected counts. The input to a mapper consists of (k, v_o) pairs where k is a unique key and v_o is a string of observed symbols. For each training instance, the mappers emit the same set of keys corresponding to the following three optimization problems to be solved during the Maximization, and their values in a hash-map:
        (1) Expected number of times a hidden state is reached (Pi).
        (2) Number of times each observable symbol is generated by each hidden state (B).
        (3) Number of transitions between each pair of states in the hidden state space (A).

        The M step computes the updated Theta(i+1) from the values generated during the E part. This involves aggregating the values (as hash-maps) for each key corresponding to one of the optimization problems. The aggregation summarizes the statistics necessary to compute a subset of the parameters for the next EM iteration. The optimal parameters for the next iteration are arrived by computing the relative frequency of each event with respect to its expected count at the current iteration. The emitted optimal parameters by each reducer are written to the HDFS and are fed to the mappers in the next iteration.

        The project can be subdivided into distinct tasks of programming, testing and documenting the driver, mapper, reducer and the combiner with the Expectation and Maximization parts split between them. For each of these tasks, a new class will be programmed, unit tested and documented within the org.apache.mahout.classifier.sequencelearning.hmm package. A list of milestones, associated deliverable and high level implementation details is given below.

        Time-line: April 26 - Aug 15.

        Milestones:

        April 26 - May 22 (4 weeks): This is the pre-coding stage. Open communication with my mentor, refine the project's plan and requirements, understand the community's code styling requirements, expand the knowledge on Hadoop and Mahout internals. Thoroughly familiarize with the classes within the classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and math packages. Write unit tests against the exisitng Mahout code.

        May 23 - June 3 (2 weeks): Implement, test and document the class HmmDriver by extending the AbstractJob class and by reusing the code from the KMeansDriver class.

        June 3 - July 1 (4 weeks): Implement, test and document the class HmmMapper. The HmmMapper class will include setup() and map() methods. The setup() method will read in the HmmModel and the parameter values obtained from the previous iteration. The map() method will call the HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() and complete the Expectation step partially.

        July 1 - July 15 (2 weeks): Implement, test and document the class HmmReducer. The reducer will complete the Expectation step by summing over all the occurences emitted by the mappers for the three optimization problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if possible.

        July 15 - July 29 (2 weeks): Implement, test and document the class HmmCombiner. The combiner will reduce the network traffic and improve efficiency by aggregating the values for each of the three keys corresponding to each of the optimization problems for the Maximization stage in reducers. Look at the possibility of code reuse from the KMeansCombiner class.

        July 29 - August 15 (2 weeks): Test the mapper, reducer, combiner and driver together. Give an example demonstrating the new parallel BW algorithm by employing the parts-of-speech tagger data set also used by the sequential BW [4]. Tidy up code and fix loose ends, finish wiki documentation.

        Additional Information:

        I am in the final stages of finishing my Master's degree in Electrical and Computer Engineering from the University of Massachusetts Amherst. Working under the guidance of Prof. Wayne Burleson, as part of my Master's research work, I have applied the theory of Markov Decision Process (MDP) to increase the duration of service of mobile computers. This semester I am involved with two course projects involving machine learning over large data sets. In the Bioinformatics class, I am mining the RCSB Protein Data Bank [5] to learn the dependence of side chain geometry on a protein's secondary structure, and comparing it with the Dynamic Bayesian Network approach used in [6]. In another project for the Online Social Networks class, I am using reinforcement learning to build an online recommendation system by reformulating MDP optimal policy search as an EM problem [7] and employing Map Reduce (extending Mahout) to arrive at it in a scalable, distributed manner.
        I owe much to the open source community as all my research experiments have only been possible due to the freely available Linux distributions, performance analyzers, scripting languages and documentation. Since joining the Apache dev mailing list, I have found the Apache Mahout's developer community vibrant, helpful and welcoming. If selected, I feel that the GSOC 2011 project will be a great learning experience for me from both a technical and professional standpoint and also allow me to contribute within my modest means to the overall spirit of open source programming.

        References:

        [1] A tutorial on hidden markov models and selected applications in speech recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 (1989), pp. 257-286.

        [2] Map-Reduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. In NIPS (2006), pp. 281-288.

        [3] Data-Intensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. Morgan & Claypool 2010.

        [4] http://flexcrfs.sourceforge.net/#Case_Study

        [5] http://www.rcsb.org/pdb/home/home.do

        [6] Beyond rotamers: a generative, probabilistic model of side chains in proteins by Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE, Hamelryck T. BMC Bioinformatics. 2010 Jun 5.

        [7] M. Toussaint and A. Storkey. Probabilistic inference for solving discrete and continuous state Markov Decision
        Processes. In ICML, 2006.
        Hide
        Dhruv Kumar added a comment -

        Submitted to the GSoC 2011 website under the Apache Software Foundation.

        Show
        Dhruv Kumar added a comment - Submitted to the GSoC 2011 website under the Apache Software Foundation.
        Hide
        Dhruv Kumar added a comment -

        Updated JIRA: added gsoc labels, edited the JIRA title to match project's title.

        Show
        Dhruv Kumar added a comment - Updated JIRA: added gsoc labels, edited the JIRA title to match project's title.
        Dhruv Kumar made changes -
        Summary Parallelization of Baum-Welch Algorithm for HMM Training Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
        Issue Type New Feature [ 2 ] Task [ 3 ]
        Priority Minor [ 4 ] Major [ 3 ]
        Labels gsoc gsoc2011 mahout-gsoc-11
        Hide
        Dhruv Kumar added a comment -

        Updated draft 2 after fixing typos and partial rewriting.

        Show
        Dhruv Kumar added a comment - Updated draft 2 after fixing typos and partial rewriting.
        Dhruv Kumar made changes -
        Description Among the unsupervised learning methods available in the current sequential implementation of HMM training (MAHOUT-396), the Baum-Welch (BW) algorithm is an attractive candidate for a parallel, MapReduce implementation. Although slower than the Viterbi training algorithm, the BW is more numerically stable and provides guaranteed discovery of Maximum Likelihood Estimator (MLE). In Viterbi training, the MLE is approximated in order to reduce computation time.

        A parallel, MapReduce implementation of BW will allow scalable model learning over large data sets. The resulting model can be used for prediction using the current sequential implementation of the Viterbi decoding algorithm.

        BW is a general case of the Expectation Maximization (EM) algorithm and it is shown that all EM algorithms are candidates for a MapReduce implementation: http://www-cs.stanford.edu/~ang/papers/nips06-mapreducemulticore.pdf. Also, the k-means clustering is an EM algorithm and has an existing MapReduce implementation in Mahout which hints at the possibility of code reuse to aid parallel BW development. The other possibility for a parallel implementation would be to use the notion of matrix probability as shown by Turin: http://tagh.de/tom/wp-content/uploads/turin-1998.pdf.

        Although non trivial, a parallel BW implementation would greatly add to the existing set of Mahout's HMM functionality.

        --------- Proposal Begin ---------

        Proposal Title: Implementation of the Baum-Welch Algorithm for parallel
        Hidden Markov Model training on Map-Reduce.

        Student Name: Dhruv Kumar

        Student E-mail: dkumar@ecs.umass.edu

        Organization/Project: Apache Mahout

        Assigned Mentor:

        Proposal Abstract:

        The Baum-Welch algorithm is commonly used for training a Hidden Markov
        Model as it is numerically stable and provides a guaranteed discovery of
        the Maximum Likelihood Estimator in the presence of incomplete data.
        Currently, Apache Mahout has a sequential implementation of the
        Baum-Welch which cannot be scaled to train over large data-sets. This
        project proposes to extend the sequential implementation of the
        Baum-Welch to a parallel, distributed version using the Map Reduce
        programming framework to allow scalable Hidden Markov Model training.

        Detailed Description:

        Hidden Markov Models (HMM) are widely used as a probabilistic inference
        tool for applications generating temporal or spatial sequential data.
        Their relative simplicity of implementation and their ability to
        discover latent domain knowledge have made them very popular in fields
        such as DNA sequence alignment, handwriting analysis, voice recognition,
        computer vision and parts-of-speech tagging.

        A HMM is defined as a tuple (S, O, Theta) where S is a finite set of
        unobservable, hidden states emitting symbols from a finite observable
        vocabulary set O according to a probabilistic model Theta. The
        parameters of the model Theta are defined by the tuple (A, B, Pi) where
        A is a stochastic transition matrix of the hidden states of size S X
        S. The elements a_(i,j) of A specify the probability of transitioning
        from a state i to state j. Matrix B is a size S X O stochastic
        symbol emission matrix whose elements b_(s, o) provide the probability
        that a symbol o will be emitted from the hidden state s. The elements
        pi_(s) of the S length vector Pi determine the probability that the
        system starts in the hidden state s. The transitions of hidden states
        are unobservable and follow the Markov property of memorylessness.

        Rabiner [1] defined three main problems for HMMs:

        1. Evaluation: Given the complete model (S, O, Theta) and a subset of
        the observation sequence, determine the probability that the model
        generated the observed sequence. This is useful for determining the
        quality of the model and is solved using the so called Forward
        algorithm.

        2. Decoding: Given the complete model (S, O, Theta) and an observation
        sequence, determine the hidden state sequence which generated the
        observed sequence. This can be viewed as an inference problem where the
        model and observed sequence are used to predict the value of the
        unobservable random variables. The backward algorithm, also known as the
        Viterbi decoding algorithm is used for predicting the hidden state
        sequence.

        3. Training: Given the set of hidden states S, the set of observation
        vocabulary O and the observation sequence, determine the parameters (A,
        B, Pi) of the model Theta. This problem can be viewed as a statistical
        machine learning problem of model fitting to a large set of training
        data. The Baum-Welch (BW) algorithm (also called the Forward-Backward
        algorithm) and the Viterbi training algorithm are commonly used for
        model fitting.

        Since most problems modeled by HMMs have a modest number of hidden and
        observable states, the sequential versions of the Forward and the
        Viterbi algorithms (currently implemented in Mahout) are sufficient for
        the evaluation and decoding purposes. However, often the training data
        is so large that a single compute node is incapable of handling it.
        Attempts to prune the training data can lead to lower quality model
        fitting and may miss out on crucial domain knowldege because of inferior
        posterior probabilities. Currently, Mahout only supports sequential HMM
        trainers--Viterbi and Baum Welch which are incapable of scaling to large
        data sets.

        This project proposes to extend Mahout's current sequential
        implementation of the Baum Welch HMM trainer to a scalable, distributed
        case. The distributed version of the BW will use the sequential
        implementations of the Forward and the Backward algorithms in each
        iteration, hence a lot of exisitng HMM training code will be reused.

        Among the two sequential HMM training methods, the Baum-Welch (BW) or
        the Forward-Backward algorithm is superiand a better candidate for a
        parallel implementation for two reasons. (1) The BW is more numerically
        stable and provides guaranteed discovery of Maximum Likelihood Estimator
        (MLE) (albiet a local maximum) unlike Viterbi training where the MLE is
        approximated in order to reduce computation time. (2) The BW belongs to
        the general class of Expectation Maximization (EM) algorithms which
        naturally fit into the Map Reduce framework [2]. Specifically, the
        parallel implementation of the BW algorithm on Map Reduce has been
        elaborated at great length in [3] and will be followed in the remainder
        of this proposal.

        The BW EM algorithm iteratively refines the model's parameters and
        consists of two distinct steps in each iteration--Expectation (E) and
        Maximization (M). In the distributed case, the E step is computed by the
        mappers and the reducers, while the M is handled by the reducers.
        Starting from an initial Theta^(0), in each iteration i, the model
        parameter tuple Theta^i is input to the algorithm, and the end result
        Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a
        user specified convergence condition expressed as a fixpoint or when the
        number of iterations exceeds a user defined value.

        The E step computes the posterior probability of each latent varaiable
        for each observed variable, weighed by the relative frequency of the
        observered variable in the input split. The mappers process independent
        training instances and emit expected state transition and emission
        counts using the forward-backward algorithm. The reducers finish E by
        aggregating the expected counts. The input to a mapper consists of (k,
        v_o) pairs where k is a unique key and v_o is a string of observerd
        symbols of length n. For each training instance, the mappers emit the
        same set of keys corresponding to the following three optimization
        problems to be solved during the M: (1) expected number of times a
        hidden state is reached, (2) the number of times each observable symbol
        is generated by each hidden state and, (3) the number of transitions
        between each pair of states in the hidden state space.

        The M step computes the updated Theta^(i+1) from the values generated
        during the E part. This involves aggregating the values obtained in the
        E step for each key corresponding to one of the optimization problems.
        The aggregation summarizes the statistics necessary to compute a subset
        of the parameters for the next EM iteration. The optimal parameters for
        the next iteration are arrived by computing the relative frequency of
        each event with respect to its expected count at the current iteration.
        The emitted optimal parameters by each reducer are written to the HDFS
        and are fed to the mappers in the next iteration.

        The project can be subdivided into distinct tasks of programming,
        testing and documenting the driver, mapper, reducer and the combiner. A
        list of milestones with their associated deliverables is given below.

        Timeline: April 26 - Aug 15.

        Milestones:

        April 26 - May 22 (4 weeks): Pre coding tasks. Open communication with
        my mentor, refine the project's plan and requirements, understand the
        code styling requirements, expand the knowledge on Hadoop and Mahout's
        internals. Thoroughly familiarize with the classes within the
        classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer
        and math packages.

        May 23 - June 3 (2 weeks): Driver. Implement and test the class
        HmmDriver by extending the AbstractJob class and by reusing the code
        from the KMeansDriver class.

        June 3 - July 1 (4 weeks): Mapper. Implement and test the class
        HmmMapper. The HmmMapper class will include setup() and map() methods.
        The setup() method will read in the HmmModel and the parameter values
        obtained from the previous iteration. The map() method will call the
        HmmAlgorithms.backwardAlgorithm() and the
        HmmAlgorithms.forwardAlgorithm() to compute the E step partially.

        July 1 - July 15 (2 weeks): Reducer. Implement and test the class
        HmmReducer. The reducer will complete the E step by summing over all the
        occurences emitted by the mappers for the three optimization problems.
        Reuse the code from the HmmTrainer.trainBaumWelch() method if possible.

        July 15 - July 29 (2 weeks): Combiner. Implement and test the class
        HmmCombiner. The combiner will reduce the network traffic and improve
        efficiency by aggregating the values for each of the three keys
        corresponding to each of the optimization problems for the M stage. Look
        at the possibility of code reuse from KMeansCombiner.

        July 29 - August 15 (2 weeks): Give an example demonstrating the new
        parallel BW algorithm by employing the parts-of-speech tagger data set
        also used by the sequential BW. Tidy up code and fix loose ends, conduct
        final tests, finish any remaining documentation.

        Additional Information:

        I am in the final stages of finishing my Master's degree in Electrical
        and Computer Engineering from the University of Massachusetts Amherst.
        Working under the guidance of Prof. Wayne Burleson, as part of my
        Master's research work, I have applied the theory of Markov Decision
        Process (MDP) to increase the duration of service of mobile computers.
        This semester I am involved with two course projects involving machine
        learning over large data sets. In a Bioinformatics class I am mining the
        RCSB Protein Data Bank to learn the dependence of side chain geometry on
        the protein's secondary structure. In another project for the Online
        Social Networks class, I am building an online recommendation system by
        using the MDP theory and recasting the problem to be an instance of
        sequential optimal control.

        I owe much to the open source community as all my research experiments
        have only been possible due to the freely available Linux distributions,
        open source performance analyzers, scripting languages such as Python
        and the suite of GNU tools. I have found the Apache Mahout's developer
        community vibrant, helpful and welcoming. If selected, I feel that the
        GSOC 2011 project will be a great learning experience for me from both a
        technical and professional standpoint and allow me to contribute within
        my modest means to the overall spirit of open coding.

        References:

        [1] A tutorial on hidden markov models and selected
        applications in speech recognition by Lawrence R. Rabiner. In
        Proceedings of the IEEE, Vol. 77 (1989), pp. 257-286.

        [2] Map-Reduce for Machine Learning on Multicore by Cheng T. Chu, Sang
        K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle
        Olukotun. In NIPS (2006), pp. 281-288.

        [3] Data-Intensive Text Processing with MapReduce by Jimmy Lin, Chris
        Dyer. Morgan & Claypool 2010.


        --------- Proposal End -------------
        Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

        Student Name: Dhruv Kumar

        Student E-mail: dkumar@ecs.umass.edu

        Organization/Project: Apache Mahout

        Assigned Mentor:

        Proposal Abstract:

        The Baum-Welch algorithm is commonly used for training a Hidden Markov Model as it is numerically stable and provides a guaranteed discovery of the Maximum Likelihood Estimator in the presence of incomplete data. Currently, Apache Mahout has a sequential implementation of the Baum-Welch which cannot be scaled to train over large data sets. This project proposes to extend the sequential implementation of the Baum-Welch to a parallel, distributed version using the Map Reduce programming framework to allow scalable Hidden Markov Model training.

        Detailed Description:

        Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool for applications generating temporal or spatial sequential data. Their relative simplicity of implementation and their ability to discover latent domain knowledge have made them very popular in fields such as DNA sequence alignment, handwriting analysis, voice recognition, computer vision and parts-of-speech tagging.

        A HMM is defined as a tuple (S, O, Theta) where S is a finite set of unobservable, hidden states emitting symbols from a finite observable vocabulary set O according to a probabilistic model Theta. The parameters of the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic transition matrix of the hidden states of size |S| X |S|. The elements a_(i,j) of A specify the probability of transitioning from a state i to state j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose elements b_(s, o) provide the probability that a symbol o will be emitted from the hidden state s. The elements pi_(s) of the |S| length vector Pi determine the probability that the system starts in the hidden state s. The transitions of hidden states are unobservable and follow the Markov property of memorylessness.

        Rabiner [1] defined three main problems for HMMs:

        1. Evaluation: Given the complete model (S, O, Theta) and a subset of the observation sequence, determine the probability that the model generated the observed sequence. This is useful for determining the quality of the model and is solved using the so called Forward algorithm.

        2. Decoding: Given the complete model (S, O, Theta) and an observation sequence, determine the hidden state sequence which generated the observed sequence. This can be viewed as an inference problem where the model and observed sequence are used to predict the value of the unobservable random variables. The backward algorithm, also known as the Viterbi decoding algorithm is used for predicting the hidden state sequence.

        3. Training: Given the set of hidden states S, the set of observation vocabulary O and the observation sequence, determine the parameters (A, B, Pi) of the model Theta. This problem can be viewed as a statistical machine learning problem of model fitting to a large set of training data. The Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and the Viterbi training algorithm are commonly used for model fitting.

        In general, the quality of HMM training can be improved by employing large training vectors but currently, Mahout only supports sequential versions of HMM trainers which are incapable of scaling. Among the Viterbi and the Baum-Welch training methods, the Baum-Welch algorithm is slower but more accurate, and a better candidate for a parallel implementation for two reasons:
        (1) The BW is more numerically stable and provides a guaranteed local maximum of the Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In Viterbi training, the MLE is approximated in order to reduce computation time.
        (2) The BW belongs to the general class of Expectation Maximization (EM) algorithms which naturally fit into the Map Reduce framework [2].

        Hence, this project proposes to extend Mahout's current sequential implementation of the Baum-Welch HMM trainer to a scalable, distributed case. Since the distributed version of the BW will use the sequential implementations of the Forward and the Backward algorithms to compute the alpha and the beta factors in each iteration, a lot of existing HMM training code will be reused. Specifically, the parallel implementation of the BW algorithm on Map Reduce has been elaborated at great length in [3] by viewing it as a specific case of the Expectation-Maximization algorithm and will be followed for implementation in this project.

        The BW EM algorithm iteratively refines the model's parameters and consists of two distinct steps in each iteration--Expectation and Maximization. In the distributed case, the Expectation step is computed by the mappers and the reducers, while the Maximization is handled by the reducers. Starting from an initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is input to the algorithm, and the end result Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a user specified convergence condition expressed as a fixpoint or when the number of iterations exceeds a user defined value.

        Expectation computes the posterior probability of each latent variable for each observed variable, weighed by the relative frequency of the observed variable in the input split. The mappers process independent training instances and emit expected state transition and emission counts using the forward-backward algorithm. The reducers finish Expectation by aggregating the expected counts. The input to a mapper consists of (k, v_o) pairs where k is a unique key and v_o is a string of observed symbols. For each training instance, the mappers emit the same set of keys corresponding to the following three optimization problems to be solved during the Maximization, and their values in a hash-map:
        (1) Expected number of times a hidden state is reached (Pi).
        (2) Number of times each observable symbol is generated by each hidden state (B).
        (3) Number of transitions between each pair of states in the hidden state space (A).

        The M step computes the updated Theta(i+1) from the values generated during the E part. This involves aggregating the values (as hash-maps) for each key corresponding to one of the optimization problems. The aggregation summarizes the statistics necessary to compute a subset of the parameters for the next EM iteration. The optimal parameters for the next iteration are arrived by computing the relative frequency of each event with respect to its expected count at the current iteration. The emitted optimal parameters by each reducer are written to the HDFS and are fed to the mappers in the next iteration.

        The project can be subdivided into distinct tasks of programming, testing and documenting the driver, mapper, reducer and the combiner with the Expectation and Maximization parts split between them. For each of these tasks, a new class will be programmed, unit tested and documented within the org.apache.mahout.classifier.sequencelearning.hmm package. A list of milestones, associated deliverable and high level implementation details is given below.

        Time-line: April 26 - Aug 15.

        Milestones:

        April 26 - May 22 (4 weeks): This is the pre-coding stage. Open communication with my mentor, refine the project's plan and requirements, understand the community's code styling requirements, expand the knowledge on Hadoop and Mahout internals. Thoroughly familiarize with the classes within the classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and math packages. Write unit tests against the exisitng Mahout code.

        May 23 - June 3 (2 weeks): Implement, test and document the class HmmDriver by extending the AbstractJob class and by reusing the code from the KMeansDriver class.

        June 3 - July 1 (4 weeks): Implement, test and document the class HmmMapper. The HmmMapper class will include setup() and map() methods. The setup() method will read in the HmmModel and the parameter values obtained from the previous iteration. The map() method will call the HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() and complete the Expectation step partially.

        July 1 - July 15 (2 weeks): Implement, test and document the class HmmReducer. The reducer will complete the Espectation step by summing over all the occurences emitted by the mappers for the three optimization problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if possible.

        July 15 - July 29 (2 weeks): Implement, test and document the class HmmCombiner. The combiner will reduce the network traffic and improve efficiency by aggregating the values for each of the three keys corresponding to each of the optimization problems for the Maximization stage in reducers. Look at the possibility of code reuse from the KMeansCombiner class.

        July 29 - August 15 (2 weeks): Test the mapper, reducer, combiner and driver together. Give an example demonstrating the new parallel BW algorithm by employing the parts-of-speech tagger data set also used by the sequential BW [4]. Tidy up code and fix loose ends, finish wiki documentation.

        Additional Information:

        I am in the final stages of finishing my Master's degree in Electrical and Computer Engineering from the University of Massachusetts Amherst. Working under the guidance of Prof. Wayne Burleson, as part of my Master's research work, I have applied the theory of Markov Decision Process (MDP) to increase the duration of service of mobile computers. This semester I am involved with two course projects involving machine learning over large data sets. In a Bioinformatics class I am mining the RCSB Protein Data Bank to learn the dependence of side chain geometry on the protein's secondary structure. In another project for the Online Social Networks class, I am building an online recommendation system using the MDP.
        I owe much to the open source community as all my research experiments have only been possible due to the freely available Linux distributions, performance analyzers, scripting languages and documentation. Since joining the Apache dev mailing list, I have found the Apache Mahout's developer community vibrant, helpful and welcoming. If selected, I feel that the GSOC 2011 project will be a great learning experience for me from both a technical and professional standpoint and also allow me to contribute within my modest means to the overall spirit of open source programming.

        References:

        [1] A tutorial on hidden markov models and selected applications in speech recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 (1989), pp. 257-286.

        [2] Map-Reduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. In NIPS (2006), pp. 281-288.

        [3] Data-Intensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. Morgan & Claypool 2010.

        [4] http://flexcrfs.sourceforge.net/#Case_Study
        Hide
        Dhruv Kumar added a comment -

        Added proposal here based on Shannon's feedback.

        Show
        Dhruv Kumar added a comment - Added proposal here based on Shannon's feedback.
        Dhruv Kumar made changes -
        Field Original Value New Value
        Description Among the unsupervised learning methods available in the current sequential implementation of HMM training (MAHOUT-396), the Baum-Welch (BW) algorithm is an attractive candidate for a parallel, MapReduce implementation. Although slower than the Viterbi training algorithm, the BW is more numerically stable and provides guaranteed discovery of Maximum Likelihood Estimator (MLE). In Viterbi training, the MLE is approximated in order to reduce computation time.

        A parallel, MapReduce implementation of BW will allow scalable model learning over large data sets. The resulting model can be used for prediction using the current sequential implementation of the Viterbi decoding algorithm.

        BW is a general case of the Expectation Maximization (EM) algorithm and it is shown that all EM algorithms are candidates for a MapReduce implementation: http://www-cs.stanford.edu/~ang/papers/nips06-mapreducemulticore.pdf. Also, the k-means clustering is an EM algorithm and has an existing MapReduce implementation in Mahout which hints at the possibility of code reuse to aid parallel BW development. The other possibility for a parallel implementation would be to use the notion of matrix probability as shown by Turin: http://tagh.de/tom/wp-content/uploads/turin-1998.pdf.

        Although non trivial, a parallel BW implementation would greatly add to the existing set of Mahout's HMM functionality.
        Among the unsupervised learning methods available in the current sequential implementation of HMM training (MAHOUT-396), the Baum-Welch (BW) algorithm is an attractive candidate for a parallel, MapReduce implementation. Although slower than the Viterbi training algorithm, the BW is more numerically stable and provides guaranteed discovery of Maximum Likelihood Estimator (MLE). In Viterbi training, the MLE is approximated in order to reduce computation time.

        A parallel, MapReduce implementation of BW will allow scalable model learning over large data sets. The resulting model can be used for prediction using the current sequential implementation of the Viterbi decoding algorithm.

        BW is a general case of the Expectation Maximization (EM) algorithm and it is shown that all EM algorithms are candidates for a MapReduce implementation: http://www-cs.stanford.edu/~ang/papers/nips06-mapreducemulticore.pdf. Also, the k-means clustering is an EM algorithm and has an existing MapReduce implementation in Mahout which hints at the possibility of code reuse to aid parallel BW development. The other possibility for a parallel implementation would be to use the notion of matrix probability as shown by Turin: http://tagh.de/tom/wp-content/uploads/turin-1998.pdf.

        Although non trivial, a parallel BW implementation would greatly add to the existing set of Mahout's HMM functionality.

        --------- Proposal Begin ---------

        Proposal Title: Implementation of the Baum-Welch Algorithm for parallel
        Hidden Markov Model training on Map-Reduce.

        Student Name: Dhruv Kumar

        Student E-mail: dkumar@ecs.umass.edu

        Organization/Project: Apache Mahout

        Assigned Mentor:

        Proposal Abstract:

        The Baum-Welch algorithm is commonly used for training a Hidden Markov
        Model as it is numerically stable and provides a guaranteed discovery of
        the Maximum Likelihood Estimator in the presence of incomplete data.
        Currently, Apache Mahout has a sequential implementation of the
        Baum-Welch which cannot be scaled to train over large data-sets. This
        project proposes to extend the sequential implementation of the
        Baum-Welch to a parallel, distributed version using the Map Reduce
        programming framework to allow scalable Hidden Markov Model training.

        Detailed Description:

        Hidden Markov Models (HMM) are widely used as a probabilistic inference
        tool for applications generating temporal or spatial sequential data.
        Their relative simplicity of implementation and their ability to
        discover latent domain knowledge have made them very popular in fields
        such as DNA sequence alignment, handwriting analysis, voice recognition,
        computer vision and parts-of-speech tagging.

        A HMM is defined as a tuple (S, O, Theta) where S is a finite set of
        unobservable, hidden states emitting symbols from a finite observable
        vocabulary set O according to a probabilistic model Theta. The
        parameters of the model Theta are defined by the tuple (A, B, Pi) where
        A is a stochastic transition matrix of the hidden states of size S X
        S. The elements a_(i,j) of A specify the probability of transitioning
        from a state i to state j. Matrix B is a size S X O stochastic
        symbol emission matrix whose elements b_(s, o) provide the probability
        that a symbol o will be emitted from the hidden state s. The elements
        pi_(s) of the S length vector Pi determine the probability that the
        system starts in the hidden state s. The transitions of hidden states
        are unobservable and follow the Markov property of memorylessness.

        Rabiner [1] defined three main problems for HMMs:

        1. Evaluation: Given the complete model (S, O, Theta) and a subset of
        the observation sequence, determine the probability that the model
        generated the observed sequence. This is useful for determining the
        quality of the model and is solved using the so called Forward
        algorithm.

        2. Decoding: Given the complete model (S, O, Theta) and an observation
        sequence, determine the hidden state sequence which generated the
        observed sequence. This can be viewed as an inference problem where the
        model and observed sequence are used to predict the value of the
        unobservable random variables. The backward algorithm, also known as the
        Viterbi decoding algorithm is used for predicting the hidden state
        sequence.

        3. Training: Given the set of hidden states S, the set of observation
        vocabulary O and the observation sequence, determine the parameters (A,
        B, Pi) of the model Theta. This problem can be viewed as a statistical
        machine learning problem of model fitting to a large set of training
        data. The Baum-Welch (BW) algorithm (also called the Forward-Backward
        algorithm) and the Viterbi training algorithm are commonly used for
        model fitting.

        Since most problems modeled by HMMs have a modest number of hidden and
        observable states, the sequential versions of the Forward and the
        Viterbi algorithms (currently implemented in Mahout) are sufficient for
        the evaluation and decoding purposes. However, often the training data
        is so large that a single compute node is incapable of handling it.
        Attempts to prune the training data can lead to lower quality model
        fitting and may miss out on crucial domain knowldege because of inferior
        posterior probabilities. Currently, Mahout only supports sequential HMM
        trainers--Viterbi and Baum Welch which are incapable of scaling to large
        data sets.

        This project proposes to extend Mahout's current sequential
        implementation of the Baum Welch HMM trainer to a scalable, distributed
        case. The distributed version of the BW will use the sequential
        implementations of the Forward and the Backward algorithms in each
        iteration, hence a lot of exisitng HMM training code will be reused.

        Among the two sequential HMM training methods, the Baum-Welch (BW) or
        the Forward-Backward algorithm is superiand a better candidate for a
        parallel implementation for two reasons. (1) The BW is more numerically
        stable and provides guaranteed discovery of Maximum Likelihood Estimator
        (MLE) (albiet a local maximum) unlike Viterbi training where the MLE is
        approximated in order to reduce computation time. (2) The BW belongs to
        the general class of Expectation Maximization (EM) algorithms which
        naturally fit into the Map Reduce framework [2]. Specifically, the
        parallel implementation of the BW algorithm on Map Reduce has been
        elaborated at great length in [3] and will be followed in the remainder
        of this proposal.

        The BW EM algorithm iteratively refines the model's parameters and
        consists of two distinct steps in each iteration--Expectation (E) and
        Maximization (M). In the distributed case, the E step is computed by the
        mappers and the reducers, while the M is handled by the reducers.
        Starting from an initial Theta^(0), in each iteration i, the model
        parameter tuple Theta^i is input to the algorithm, and the end result
        Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a
        user specified convergence condition expressed as a fixpoint or when the
        number of iterations exceeds a user defined value.

        The E step computes the posterior probability of each latent varaiable
        for each observed variable, weighed by the relative frequency of the
        observered variable in the input split. The mappers process independent
        training instances and emit expected state transition and emission
        counts using the forward-backward algorithm. The reducers finish E by
        aggregating the expected counts. The input to a mapper consists of (k,
        v_o) pairs where k is a unique key and v_o is a string of observerd
        symbols of length n. For each training instance, the mappers emit the
        same set of keys corresponding to the following three optimization
        problems to be solved during the M: (1) expected number of times a
        hidden state is reached, (2) the number of times each observable symbol
        is generated by each hidden state and, (3) the number of transitions
        between each pair of states in the hidden state space.

        The M step computes the updated Theta^(i+1) from the values generated
        during the E part. This involves aggregating the values obtained in the
        E step for each key corresponding to one of the optimization problems.
        The aggregation summarizes the statistics necessary to compute a subset
        of the parameters for the next EM iteration. The optimal parameters for
        the next iteration are arrived by computing the relative frequency of
        each event with respect to its expected count at the current iteration.
        The emitted optimal parameters by each reducer are written to the HDFS
        and are fed to the mappers in the next iteration.

        The project can be subdivided into distinct tasks of programming,
        testing and documenting the driver, mapper, reducer and the combiner. A
        list of milestones with their associated deliverables is given below.

        Timeline: April 26 - Aug 15.

        Milestones:

        April 26 - May 22 (4 weeks): Pre coding tasks. Open communication with
        my mentor, refine the project's plan and requirements, understand the
        code styling requirements, expand the knowledge on Hadoop and Mahout's
        internals. Thoroughly familiarize with the classes within the
        classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer
        and math packages.

        May 23 - June 3 (2 weeks): Driver. Implement and test the class
        HmmDriver by extending the AbstractJob class and by reusing the code
        from the KMeansDriver class.

        June 3 - July 1 (4 weeks): Mapper. Implement and test the class
        HmmMapper. The HmmMapper class will include setup() and map() methods.
        The setup() method will read in the HmmModel and the parameter values
        obtained from the previous iteration. The map() method will call the
        HmmAlgorithms.backwardAlgorithm() and the
        HmmAlgorithms.forwardAlgorithm() to compute the E step partially.

        July 1 - July 15 (2 weeks): Reducer. Implement and test the class
        HmmReducer. The reducer will complete the E step by summing over all the
        occurences emitted by the mappers for the three optimization problems.
        Reuse the code from the HmmTrainer.trainBaumWelch() method if possible.

        July 15 - July 29 (2 weeks): Combiner. Implement and test the class
        HmmCombiner. The combiner will reduce the network traffic and improve
        efficiency by aggregating the values for each of the three keys
        corresponding to each of the optimization problems for the M stage. Look
        at the possibility of code reuse from KMeansCombiner.

        July 29 - August 15 (2 weeks): Give an example demonstrating the new
        parallel BW algorithm by employing the parts-of-speech tagger data set
        also used by the sequential BW. Tidy up code and fix loose ends, conduct
        final tests, finish any remaining documentation.

        Additional Information:

        I am in the final stages of finishing my Master's degree in Electrical
        and Computer Engineering from the University of Massachusetts Amherst.
        Working under the guidance of Prof. Wayne Burleson, as part of my
        Master's research work, I have applied the theory of Markov Decision
        Process (MDP) to increase the duration of service of mobile computers.
        This semester I am involved with two course projects involving machine
        learning over large data sets. In a Bioinformatics class I am mining the
        RCSB Protein Data Bank to learn the dependence of side chain geometry on
        the protein's secondary structure. In another project for the Online
        Social Networks class, I am building an online recommendation system by
        using the MDP theory and recasting the problem to be an instance of
        sequential optimal control.

        I owe much to the open source community as all my research experiments
        have only been possible due to the freely available Linux distributions,
        open source performance analyzers, scripting languages such as Python
        and the suite of GNU tools. I have found the Apache Mahout's developer
        community vibrant, helpful and welcoming. If selected, I feel that the
        GSOC 2011 project will be a great learning experience for me from both a
        technical and professional standpoint and allow me to contribute within
        my modest means to the overall spirit of open coding.

        References:

        [1] A tutorial on hidden markov models and selected
        applications in speech recognition by Lawrence R. Rabiner. In
        Proceedings of the IEEE, Vol. 77 (1989), pp. 257-286.

        [2] Map-Reduce for Machine Learning on Multicore by Cheng T. Chu, Sang
        K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle
        Olukotun. In NIPS (2006), pp. 281-288.

        [3] Data-Intensive Text Processing with MapReduce by Jimmy Lin, Chris
        Dyer. Morgan & Claypool 2010.


        --------- Proposal End -------------
        Hide
        Robin Anil added a comment -

        You can discuss on the group and update the JIRA with concrete plans

        Show
        Robin Anil added a comment - You can discuss on the group and update the JIRA with concrete plans
        Hide
        Dhruv Kumar added a comment -

        How are the GSoC proposals discussed in Mahout?

        I was wondering if I should continue with the proposal here by editing the JIRA or email it to the dev list.

        Show
        Dhruv Kumar added a comment - How are the GSoC proposals discussed in Mahout? I was wondering if I should continue with the proposal here by editing the JIRA or email it to the dev list.
        Hide
        Ted Dunning added a comment -

        Sounds like a good project, especially if it helps the community learn about issues for code reuse
        in EM algorithms.

        Show
        Ted Dunning added a comment - Sounds like a good project, especially if it helps the community learn about issues for code reuse in EM algorithms.
        Hide
        Dhruv Kumar added a comment -

        As suggested by Ted, I'm creating this JIRA issue to foster feedback. Like I mentioned on the dev-list, while I'm working on this issue for a Bioinformatics class project, I'd be happy to extend it for a GSoC 2011 proposal.

        Show
        Dhruv Kumar added a comment - As suggested by Ted, I'm creating this JIRA issue to foster feedback. Like I mentioned on the dev-list, while I'm working on this issue for a Bioinformatics class project, I'd be happy to extend it for a GSoC 2011 proposal.
        Dhruv Kumar created issue -

          People

          • Assignee:
            Grant Ingersoll
            Reporter:
            Dhruv Kumar
          • Votes:
            2 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development