Details

Type: Task

Status: Closed

Priority: Major

Resolution: Won't Fix

Affects Version/s: 0.4, 0.5

Fix Version/s: 0.10.0

Component/s: Classification

Labels:
Description
Proposal Title: BaumWelch Algorithm on MapReduce for Parallel Hidden Markov Model Training.
Student Name: Dhruv Kumar
Student Email: dkumar@ecs.umass.edu
Organization/Project: Apache Mahout
Assigned Mentor:
Proposal Abstract:
The BaumWelch algorithm is commonly used for training a Hidden Markov Model because of its superior numerical stability and its ability to guarantee the discovery of a locally maximum, Maximum Likelihood Estimator, in the presence of incomplete training data. Currently, Apache Mahout has a sequential implementation of the BaumWelch which cannot be scaled to train over large data sets. This restriction reduces the quality of training and constrains generalization of the learned model when used for prediction. This project proposes to extend Mahout's BaumWelch to a parallel, distributed version using the MapReduce programming framework for enhanced model fitting over large data sets.
Detailed Description:
Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool for applications generating temporal or spatial sequential data. Relative simplicity of implementation, combined with their ability to discover latent domain knowledge have made them very popular in diverse fields such as DNA sequence alignment, gene discovery, handwriting analysis, voice recognition, computer vision, language translation and partsofspeech tagging.
A HMM is defined as a tuple (S, O, Theta) where S is a finite set of unobservable, hidden states emitting symbols from a finite observable vocabulary set O according to a probabilistic model Theta. The parameters of the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic transition matrix of the hidden states of size S X S. The elements a_(i,j) of A specify the probability of transitioning from a state i to state j. Matrix B is a size S X O stochastic symbol emission matrix whose elements b_(s, o) provide the probability that a symbol o will be emitted from the hidden state s. The elements pi_(s) of the S length vector Pi determine the probability that the system starts in the hidden state s. The transitions of hidden states are unobservable and follow the Markov property of memorylessness.
Rabiner [1] defined three main problems for HMMs:
1. Evaluation: Given the complete model (S, O, Theta) and a subset of the observation sequence, determine the probability that the model generated the observed sequence. This is useful for evaluating the quality of the model and is solved using the so called Forward algorithm.
2. Decoding: Given the complete model (S, O, Theta) and an observation sequence, determine the hidden state sequence which generated the observed sequence. This can be viewed as an inference problem where the model and observed sequence are used to predict the value of the unobservable random variables. The backward algorithm, also known as the Viterbi decoding algorithm is used for predicting the hidden state sequence.
3. Training: Given the set of hidden states S, the set of observation vocabulary O and the observation sequence, determine the parameters (A, B, Pi) of the model Theta. This problem can be viewed as a statistical machine learning problem of model fitting to a large set of training data. The BaumWelch (BW) algorithm (also called the ForwardBackward algorithm) and the Viterbi training algorithm are commonly used for model fitting.
In general, the quality of HMM training can be improved by employing large training vectors but currently, Mahout only supports sequential versions of HMM trainers which are incapable of scaling. Among the Viterbi and the BaumWelch training methods, the BaumWelch algorithm is superior, accurate, and a better candidate for a parallel implementation for two reasons:
(1) The BW is numerically stable and provides a guaranteed discovery of the locally maximum, Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In Viterbi training, the MLE is approximated in order to reduce computation time.
(2) The BW belongs to the general class of Expectation Maximization (EM) algorithms which naturally fit into the MapReduce framework [2], such as the existing Map Reduce implementation of kmeans in Mahout.
Hence, this project proposes to extend Mahout's current sequential implementation of the BaumWelch HMM trainer to a scalable, distributed case. Since the distributed version of the BW will use the sequential implementations of the Forward and the Backward algorithms to compute the alpha and the beta factors in each iteration, a lot of existing HMM training code will be reused. Specifically, the parallel implementation of the BW algorithm on Map Reduce has been elaborated at great length in [3] by viewing it as a specific case of the ExpectationMaximization algorithm and will be followed for implementation in this project.
The BW EM algorithm iteratively refines the model's parameters and consists of two distinct steps in each iterationExpectation and Maximization. In the distributed case, the Expectation step is computed by the mappers and the reducers, while the Maximization is handled by the reducers. Starting from an initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is input to the algorithm, and the end result Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a user specified convergence condition expressed as a fixpoint or when the number of iterations exceeds a user defined value.
Expectation computes the posterior probability of each latent variable for each observed variable, weighed by the relative frequency of the observed variable in the input split. The mappers process independent training instances and emit expected state transition and emission counts using the Forward and Backward algorithms. The reducers finish Expectation by aggregating the expected counts. The input to a mapper consists of (k, v_o) pairs where k is a unique key and v_o is a string of observed symbols. For each training instance, the mappers emit the same set of keys corresponding to the following three optimization problems to be solved during the Maximization, and their values in a hashmap:
(1) Expected number of times a hidden state is reached (Pi).
(2) Number of times each observable symbol is generated by each hidden state (B).
(3) Number of transitions between each pair of states in the hidden state space (A).
The M step computes the updated Theta(i+1) from the values generated during the E part. This involves aggregating the values (as hashmaps) for each key corresponding to one of the optimization problems. The aggregation summarizes the statistics necessary to compute a subset of the parameters for the next EM iteration. The optimal parameters for the next iteration are arrived by computing the relative frequency of each event with respect to its expected count at the current iteration. The emitted optimal parameters by each reducer are written to the HDFS and are fed to the mappers in the next iteration.
The project can be subdivided into distinct tasks of programming, testing and documenting the driver, mapper, reducer and the combiner with the Expectation and Maximization parts split between them. For each of these tasks, a new class will be programmed, unit tested and documented within the org.apache.mahout.classifier.sequencelearning.hmm package. Since kmeans is also an EM algorithm, particular attention will be paid to its code at each step for possible reuse.
A list of milestones, associated deliverable and high level implementation details is given below.
Timeline: April 26  Aug 15.
Milestones:
April 26  May 22 (4 weeks): Precoding stage. Open communication with my mentor, refine the project's plan and requirements, understand the community's code styling requirements, expand the knowledge on Hadoop and Mahout internals. Thoroughly familiarize with the classes within the classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and math packages.
May 23  June 3 (2 weeks): Work on Driver. Implement, test and document the class HmmDriver by extending the AbstractJob class and by reusing the code from the KMeansDriver class.
June 3  July 1 (4 weeks): Work on Mapper. Implement, test and document the class HmmMapper. The HmmMapper class will include setup() and map() methods. The setup() method will read in the HmmModel and the parameter values obtained from the previous iteration. The map() method will call the HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() and complete the Expectation step partially.
July 1  July 15 (2 weeks): Work on Reducer. Implement, test and document the class HmmReducer. The reducer will complete the Expectation step by summing over all the occurences emitted by the mappers for the three optimization problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if possible. Also, midterm review.
July 15  July 29 (2 weeks): Work on Combiner. Implement, test and document the class HmmCombiner. The combiner will reduce the network traffic and improve efficiency by aggregating the values for each of the three keys corresponding to each of the optimization problems for the Maximization stage in reducers. Look at the possibility of code reuse from the KMeansCombiner class.
July 29  August 15 (2 weeks): Final touches. Test the mapper, reducer, combiner and driver together. Give an example demonstrating the new parallel BW algorithm by employing the partsofspeech tagger data set also used by the sequential BW [4]. Tidy up code and fix loose ends, finish wiki documentation.
Additional Information:
I am in the final stages of finishing my Master's degree in Electrical and Computer Engineering from the University of Massachusetts Amherst. Working under the guidance of Prof. Wayne Burleson, as part of my Master's research work, I have applied the theory of Markov Decision Process (MDP) to increase the duration of service of mobile computers. This semester I am involved with two course projects involving machine learning over large data sets. In the Bioinformatics class, I am mining the RCSB Protein Data Bank [5] to learn the dependence of side chain geometry on a protein's secondary structure, and comparing it with the Dynamic Bayesian Network approach used in [6]. In another project for the Online Social Networks class, I am using reinforcement learning to build an online recommendation system by reformulating MDP optimal policy search as an EM problem [7] and employing Map Reduce (extending Mahout) to arrive at it in a scalable, distributed manner.
I owe much to the open source community as all my research experiments have only been possible due to the freely available Linux distributions, performance analyzers, scripting languages and associated documentation. After joining the Apache Mahout's developer mailing list a few weeks ago, I have found the community extremely vibrant, helpful and welcoming. If selected, I feel that the GSOC 2011 project will be a great learning experience for me from both a technical and professional standpoint and will also allow me to contribute within my modest means to the overall spirit of open source programming and Machine Learning.
References:
[1] A tutorial on hidden markov models and selected applications in speech recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 (1989), pp. 257286.
[2] MapReduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. In NIPS (2006), pp. 281288.
[3] DataIntensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. Morgan & Claypool 2010.
[4] http://flexcrfs.sourceforge.net/#Case_Study
[5] http://www.rcsb.org/pdb/home/home.do
[6] Beyond rotamers: a generative, probabilistic model of side chains in proteins by Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE, Hamelryck T. BMC Bioinformatics. 2010 Jun 5.
[7] Probabilistic inference for solving discrete and continuous state Markov Decision Processes by M. Toussaint and A. Storkey. ICML, 2006.
Activity
 All
 Comments
 Work Log
 History
 Activity
 Transitions
Status  Resolved [ 5 ]  Closed [ 6 ] 
Fix Version/s  0.10.0 [ 12329709 ] 
Fix Version/s  Backlog [ 12318886 ] 
Status  Patch Available [ 10002 ]  Resolved [ 5 ] 
Resolution  Won't Fix [ 2 ] 
Fix Version/s  Backlog [ 12318886 ]  
Fix Version/s  0.9 [ 12324577 ] 
Fix Version/s  0.9 [ 12324577 ]  
Fix Version/s  0.8 [ 12320153 ] 
Attachment  MAHOUT627.patch [ 12544947 ] 
Attachment  screenshot.png [ 12526870 ] 
Fix Version/s  0.8 [ 12320153 ]  
Fix Version/s  0.7 [ 12319261 ] 
Fix Version/s  0.7 [ 12319261 ]  
Fix Version/s  0.6 [ 12316364 ] 
Attachment  MAHOUT627.patch [ 12507850 ] 
Attachment  MAHOUT627.patch [ 12506598 ] 
Attachment  MAHOUT627.patch [ 12493862 ] 
Attachment  MAHOUT627.patch [ 12491473 ] 
Attachment  MAHOUT627.patch [ 12491218 ] 
Attachment  MAHOUT627.patch [ 12490564 ] 
Status  Open [ 1 ]  Patch Available [ 10002 ] 
Attachment  MAHOUT627.patch [ 12486400 ] 
Attachment  MAHOUT627.patch [ 12486067 ] 
Attachment  MAHOUT627.patch [ 12484498 ] 
Attachment  MAHOUT627.patch [ 12483899 ] 
Status  Patch Available [ 10002 ]  Open [ 1 ] 
Status  Open [ 1 ]  Patch Available [ 10002 ] 
Assignee  Grant Ingersoll [ gsingers ]  
Fix Version/s  0.6 [ 12316364 ]  
Affects Version/s  0.5 [ 12315255 ] 
Description 
Proposal Title: BaumWelch Algorithm on MapReduce for Parallel Hidden Markov Model Training.
Student Name: Dhruv Kumar Student Email: dkumar@ecs.umass.edu Organization/Project: Apache Mahout Assigned Mentor: Proposal Abstract: The BaumWelch algorithm is commonly used for training a Hidden Markov Model because of its superior numerical stability and its ability to guarantee the discovery of a locally maximum, Maximum Likelihood Estimator, in the presence of incomplete training data. Currently, Apache Mahout has a sequential implementation of the BaumWelch which cannot be scaled to train over large data sets. This restriction reduces the quality of training and constrains generalization of the learned model when used for prediction. This project proposes to extend Mahout's BaumWelch to a parallel, distributed version using the MapReduce programming framework for enhanced model fitting over large data sets. Detailed Description: Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool for applications generating temporal or spatial sequential data. Relative simplicity of implementation, combined with their ability to discover latent domain knowledge have made them very popular in diverse fields such as DNA sequence alignment, gene discovery, handwriting analysis, voice recognition, computer vision, language translation and partsofspeech tagging. A HMM is defined as a tuple (S, O, Theta) where S is a finite set of unobservable, hidden states emitting symbols from a finite observable vocabulary set O according to a probabilistic model Theta. The parameters of the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic transition matrix of the hidden states of size S X S. The elements a_(i,j) of A specify the probability of transitioning from a state i to state j. Matrix B is a size S X O stochastic symbol emission matrix whose elements b_(s, o) provide the probability that a symbol o will be emitted from the hidden state s. The elements pi_(s) of the S length vector Pi determine the probability that the system starts in the hidden state s. The transitions of hidden states are unobservable and follow the Markov property of memorylessness. Rabiner [1] defined three main problems for HMMs: 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the observation sequence, determine the probability that the model generated the observed sequence. This is useful for evaluating the quality of the model and is solved using the so called Forward algorithm. 2. Decoding: Given the complete model (S, O, Theta) and an observation sequence, determine the hidden state sequence which generated the observed sequence. This can be viewed as an inference problem where the model and observed sequence are used to predict the value of the unobservable random variables. The backward algorithm, also known as the Viterbi decoding algorithm is used for predicting the hidden state sequence. 3. Training: Given the set of hidden states S, the set of observation vocabulary O and the observation sequence, determine the parameters (A, B, Pi) of the model Theta. This problem can be viewed as a statistical machine learning problem of model fitting to a large set of training data. The BaumWelch (BW) algorithm (also called the ForwardBackward algorithm) and the Viterbi training algorithm are commonly used for model fitting. In general, the quality of HMM training can be improved by employing large training vectors but currently, Mahout only supports sequential versions of HMM trainers which are incapable of scaling. Among the Viterbi and the BaumWelch training methods, the BaumWelch algorithm is superior, accurate, and a better candidate for a parallel implementation for two reasons: (1) The BW is numerically stable and provides a guaranteed discovery of the locally maximum, Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In Viterbi training, the MLE is approximated in order to reduce computation time. (2) The BW belongs to the general class of Expectation Maximization (EM) algorithms which naturally fit into the MapReduce framework [2], such as the existing Map Reduce implementation of kmeans in Mahout. Hence, this project proposes to extend Mahout's current sequential implementation of the BaumWelch HMM trainer to a scalable, distributed case. Since the distributed version of the BW will use the sequential implementations of the Forward and the Backward algorithms to compute the alpha and the beta factors in each iteration, a lot of existing HMM training code will be reused. Specifically, the parallel implementation of the BW algorithm on Map Reduce has been elaborated at great length in [3] by viewing it as a specific case of the ExpectationMaximization algorithm and will be followed for implementation in this project. The BW EM algorithm iteratively refines the model's parameters and consists of two distinct steps in each iterationExpectation and Maximization. In the distributed case, the Expectation step is computed by the mappers and the reducers, while the Maximization is handled by the reducers. Starting from an initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is input to the algorithm, and the end result Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a user specified convergence condition expressed as a fixpoint or when the number of iterations exceeds a user defined value. Expectation computes the posterior probability of each latent variable for each observed variable, weighed by the relative frequency of the observed variable in the input split. The mappers process independent training instances and emit expected state transition and emission counts using the Forward and Backward algorithms. The reducers finish Expectation by aggregating the expected counts. The input to a mapper consists of (k, v_o) pairs where k is a unique key and v_o is a string of observed symbols. For each training instance, the mappers emit the same set of keys corresponding to the following three optimization problems to be solved during the Maximization, and their values in a hashmap: (1) Expected number of times a hidden state is reached (Pi). (2) Number of times each observable symbol is generated by each hidden state (B). (3) Number of transitions between each pair of states in the hidden state space (A). The M step computes the updated Theta(i+1) from the values generated during the E part. This involves aggregating the values (as hashmaps) for each key corresponding to one of the optimization problems. The aggregation summarizes the statistics necessary to compute a subset of the parameters for the next EM iteration. The optimal parameters for the next iteration are arrived by computing the relative frequency of each event with respect to its expected count at the current iteration. The emitted optimal parameters by each reducer are written to the HDFS and are fed to the mappers in the next iteration. The project can be subdivided into distinct tasks of programming, testing and documenting the driver, mapper, reducer and the combiner with the Expectation and Maximization parts split between them. For each of these tasks, a new class will be programmed, unit tested and documented within the org.apache.mahout.classifier.sequencelearning.hmm package. Since kmeans is also an EM algorithm, particular attention will paid to its code at each step for possible reuse. A list of milestones, associated deliverable and high level implementation details is given below. Timeline: April 26  Aug 15. Milestones: April 26  May 22 (4 weeks): Precoding stage. Open communication with my mentor, refine the project's plan and requirements, understand the community's code styling requirements, expand the knowledge on Hadoop and Mahout internals. Thoroughly familiarize with the classes within the classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and math packages. May 23  June 3 (2 weeks): Work on Driver. Implement, test and document the class HmmDriver by extending the AbstractJob class and by reusing the code from the KMeansDriver class. June 3  July 1 (4 weeks): Work on Mapper. Implement, test and document the class HmmMapper. The HmmMapper class will include setup() and map() methods. The setup() method will read in the HmmModel and the parameter values obtained from the previous iteration. The map() method will call the HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() and complete the Expectation step partially. July 1  July 15 (2 weeks): Work on Reducer. Implement, test and document the class HmmReducer. The reducer will complete the Expectation step by summing over all the occurences emitted by the mappers for the three optimization problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if possible. Also, midterm review. July 15  July 29 (2 weeks): Work on Combiner. Implement, test and document the class HmmCombiner. The combiner will reduce the network traffic and improve efficiency by aggregating the values for each of the three keys corresponding to each of the optimization problems for the Maximization stage in reducers. Look at the possibility of code reuse from the KMeansCombiner class. July 29  August 15 (2 weeks): Final touches. Test the mapper, reducer, combiner and driver together. Give an example demonstrating the new parallel BW algorithm by employing the partsofspeech tagger data set also used by the sequential BW [4]. Tidy up code and fix loose ends, finish wiki documentation. Additional Information: I am in the final stages of finishing my Master's degree in Electrical and Computer Engineering from the University of Massachusetts Amherst. Working under the guidance of Prof. Wayne Burleson, as part of my Master's research work, I have applied the theory of Markov Decision Process (MDP) to increase the duration of service of mobile computers. This semester I am involved with two course projects involving machine learning over large data sets. In the Bioinformatics class, I am mining the RCSB Protein Data Bank [5] to learn the dependence of side chain geometry on a protein's secondary structure, and comparing it with the Dynamic Bayesian Network approach used in [6]. In another project for the Online Social Networks class, I am using reinforcement learning to build an online recommendation system by reformulating MDP optimal policy search as an EM problem [7] and employing Map Reduce (extending Mahout) to arrive at it in a scalable, distributed manner. I owe much to the open source community as all my research experiments have only been possible due to the freely available Linux distributions, performance analyzers, scripting languages and associated documentation. After joining the Apache Mahout's developer mailing list a few weeks ago, I have found the community extremely vibrant, helpful and welcoming. If selected, I feel that the GSOC 2011 project will be a great learning experience for me from both a technical and professional standpoint and will also allow me to contribute within my modest means to the overall spirit of open source programming and Machine Learning. References: [1] A tutorial on hidden markov models and selected applications in speech recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 (1989), pp. 257286. [2] MapReduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. In NIPS (2006), pp. 281288. [3] DataIntensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. Morgan & Claypool 2010. [4] http://flexcrfs.sourceforge.net/#Case_Study [5] http://www.rcsb.org/pdb/home/home.do [6] Beyond rotamers: a generative, probabilistic model of side chains in proteins by Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE, Hamelryck T. BMC Bioinformatics. 2010 Jun 5. [7] Probabilistic inference for solving discrete and continuous state Markov Decision Processes by M. Toussaint and A. Storkey. ICML, 2006. 
Proposal Title: BaumWelch Algorithm on MapReduce for Parallel Hidden Markov Model Training.
Student Name: Dhruv Kumar Student Email: dkumar@ecs.umass.edu Organization/Project: Apache Mahout Assigned Mentor: Proposal Abstract: The BaumWelch algorithm is commonly used for training a Hidden Markov Model because of its superior numerical stability and its ability to guarantee the discovery of a locally maximum, Maximum Likelihood Estimator, in the presence of incomplete training data. Currently, Apache Mahout has a sequential implementation of the BaumWelch which cannot be scaled to train over large data sets. This restriction reduces the quality of training and constrains generalization of the learned model when used for prediction. This project proposes to extend Mahout's BaumWelch to a parallel, distributed version using the MapReduce programming framework for enhanced model fitting over large data sets. Detailed Description: Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool for applications generating temporal or spatial sequential data. Relative simplicity of implementation, combined with their ability to discover latent domain knowledge have made them very popular in diverse fields such as DNA sequence alignment, gene discovery, handwriting analysis, voice recognition, computer vision, language translation and partsofspeech tagging. A HMM is defined as a tuple (S, O, Theta) where S is a finite set of unobservable, hidden states emitting symbols from a finite observable vocabulary set O according to a probabilistic model Theta. The parameters of the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic transition matrix of the hidden states of size S X S. The elements a_(i,j) of A specify the probability of transitioning from a state i to state j. Matrix B is a size S X O stochastic symbol emission matrix whose elements b_(s, o) provide the probability that a symbol o will be emitted from the hidden state s. The elements pi_(s) of the S length vector Pi determine the probability that the system starts in the hidden state s. The transitions of hidden states are unobservable and follow the Markov property of memorylessness. Rabiner [1] defined three main problems for HMMs: 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the observation sequence, determine the probability that the model generated the observed sequence. This is useful for evaluating the quality of the model and is solved using the so called Forward algorithm. 2. Decoding: Given the complete model (S, O, Theta) and an observation sequence, determine the hidden state sequence which generated the observed sequence. This can be viewed as an inference problem where the model and observed sequence are used to predict the value of the unobservable random variables. The backward algorithm, also known as the Viterbi decoding algorithm is used for predicting the hidden state sequence. 3. Training: Given the set of hidden states S, the set of observation vocabulary O and the observation sequence, determine the parameters (A, B, Pi) of the model Theta. This problem can be viewed as a statistical machine learning problem of model fitting to a large set of training data. The BaumWelch (BW) algorithm (also called the ForwardBackward algorithm) and the Viterbi training algorithm are commonly used for model fitting. In general, the quality of HMM training can be improved by employing large training vectors but currently, Mahout only supports sequential versions of HMM trainers which are incapable of scaling. Among the Viterbi and the BaumWelch training methods, the BaumWelch algorithm is superior, accurate, and a better candidate for a parallel implementation for two reasons: (1) The BW is numerically stable and provides a guaranteed discovery of the locally maximum, Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In Viterbi training, the MLE is approximated in order to reduce computation time. (2) The BW belongs to the general class of Expectation Maximization (EM) algorithms which naturally fit into the MapReduce framework [2], such as the existing Map Reduce implementation of kmeans in Mahout. Hence, this project proposes to extend Mahout's current sequential implementation of the BaumWelch HMM trainer to a scalable, distributed case. Since the distributed version of the BW will use the sequential implementations of the Forward and the Backward algorithms to compute the alpha and the beta factors in each iteration, a lot of existing HMM training code will be reused. Specifically, the parallel implementation of the BW algorithm on Map Reduce has been elaborated at great length in [3] by viewing it as a specific case of the ExpectationMaximization algorithm and will be followed for implementation in this project. The BW EM algorithm iteratively refines the model's parameters and consists of two distinct steps in each iterationExpectation and Maximization. In the distributed case, the Expectation step is computed by the mappers and the reducers, while the Maximization is handled by the reducers. Starting from an initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is input to the algorithm, and the end result Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a user specified convergence condition expressed as a fixpoint or when the number of iterations exceeds a user defined value. Expectation computes the posterior probability of each latent variable for each observed variable, weighed by the relative frequency of the observed variable in the input split. The mappers process independent training instances and emit expected state transition and emission counts using the Forward and Backward algorithms. The reducers finish Expectation by aggregating the expected counts. The input to a mapper consists of (k, v_o) pairs where k is a unique key and v_o is a string of observed symbols. For each training instance, the mappers emit the same set of keys corresponding to the following three optimization problems to be solved during the Maximization, and their values in a hashmap: (1) Expected number of times a hidden state is reached (Pi). (2) Number of times each observable symbol is generated by each hidden state (B). (3) Number of transitions between each pair of states in the hidden state space (A). The M step computes the updated Theta(i+1) from the values generated during the E part. This involves aggregating the values (as hashmaps) for each key corresponding to one of the optimization problems. The aggregation summarizes the statistics necessary to compute a subset of the parameters for the next EM iteration. The optimal parameters for the next iteration are arrived by computing the relative frequency of each event with respect to its expected count at the current iteration. The emitted optimal parameters by each reducer are written to the HDFS and are fed to the mappers in the next iteration. The project can be subdivided into distinct tasks of programming, testing and documenting the driver, mapper, reducer and the combiner with the Expectation and Maximization parts split between them. For each of these tasks, a new class will be programmed, unit tested and documented within the org.apache.mahout.classifier.sequencelearning.hmm package. Since kmeans is also an EM algorithm, particular attention will be paid to its code at each step for possible reuse. A list of milestones, associated deliverable and high level implementation details is given below. Timeline: April 26  Aug 15. Milestones: April 26  May 22 (4 weeks): Precoding stage. Open communication with my mentor, refine the project's plan and requirements, understand the community's code styling requirements, expand the knowledge on Hadoop and Mahout internals. Thoroughly familiarize with the classes within the classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and math packages. May 23  June 3 (2 weeks): Work on Driver. Implement, test and document the class HmmDriver by extending the AbstractJob class and by reusing the code from the KMeansDriver class. June 3  July 1 (4 weeks): Work on Mapper. Implement, test and document the class HmmMapper. The HmmMapper class will include setup() and map() methods. The setup() method will read in the HmmModel and the parameter values obtained from the previous iteration. The map() method will call the HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() and complete the Expectation step partially. July 1  July 15 (2 weeks): Work on Reducer. Implement, test and document the class HmmReducer. The reducer will complete the Expectation step by summing over all the occurences emitted by the mappers for the three optimization problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if possible. Also, midterm review. July 15  July 29 (2 weeks): Work on Combiner. Implement, test and document the class HmmCombiner. The combiner will reduce the network traffic and improve efficiency by aggregating the values for each of the three keys corresponding to each of the optimization problems for the Maximization stage in reducers. Look at the possibility of code reuse from the KMeansCombiner class. July 29  August 15 (2 weeks): Final touches. Test the mapper, reducer, combiner and driver together. Give an example demonstrating the new parallel BW algorithm by employing the partsofspeech tagger data set also used by the sequential BW [4]. Tidy up code and fix loose ends, finish wiki documentation. Additional Information: I am in the final stages of finishing my Master's degree in Electrical and Computer Engineering from the University of Massachusetts Amherst. Working under the guidance of Prof. Wayne Burleson, as part of my Master's research work, I have applied the theory of Markov Decision Process (MDP) to increase the duration of service of mobile computers. This semester I am involved with two course projects involving machine learning over large data sets. In the Bioinformatics class, I am mining the RCSB Protein Data Bank [5] to learn the dependence of side chain geometry on a protein's secondary structure, and comparing it with the Dynamic Bayesian Network approach used in [6]. In another project for the Online Social Networks class, I am using reinforcement learning to build an online recommendation system by reformulating MDP optimal policy search as an EM problem [7] and employing Map Reduce (extending Mahout) to arrive at it in a scalable, distributed manner. I owe much to the open source community as all my research experiments have only been possible due to the freely available Linux distributions, performance analyzers, scripting languages and associated documentation. After joining the Apache Mahout's developer mailing list a few weeks ago, I have found the community extremely vibrant, helpful and welcoming. If selected, I feel that the GSOC 2011 project will be a great learning experience for me from both a technical and professional standpoint and will also allow me to contribute within my modest means to the overall spirit of open source programming and Machine Learning. References: [1] A tutorial on hidden markov models and selected applications in speech recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 (1989), pp. 257286. [2] MapReduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. In NIPS (2006), pp. 281288. [3] DataIntensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. Morgan & Claypool 2010. [4] http://flexcrfs.sourceforge.net/#Case_Study [5] http://www.rcsb.org/pdb/home/home.do [6] Beyond rotamers: a generative, probabilistic model of side chains in proteins by Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE, Hamelryck T. BMC Bioinformatics. 2010 Jun 5. [7] Probabilistic inference for solving discrete and continuous state Markov Decision Processes by M. Toussaint and A. Storkey. ICML, 2006. 
Description 
Proposal Title: BaumWelch Algorithm on MapReduce for Parallel Hidden Markov Model Training.
Student Name: Dhruv Kumar Student Email: dkumar@ecs.umass.edu Organization/Project: Apache Mahout Assigned Mentor: Proposal Abstract: The BaumWelch algorithm is commonly used for training a Hidden Markov Model because of its superior numerical stability and its ability to guarantee the discovery of a locally maximum, Maximum Likelihood Estimator, in the presence of incomplete training data. Currently, Apache Mahout has a sequential implementation of the BaumWelch which cannot be scaled to train over large data sets. This restriction reduces the quality of training and constrains generalization of the learned model when used for prediction. This project proposes to extend Mahout's sequential implementation of the BaumWelch to a parallel, distributed version using the MapReduce programming framework to allow training at a large scale for enhanced model fitting. Detailed Description: Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool for applications generating temporal or spatial sequential data. Relative simplicity of implementation, combined with their ability to discover latent domain knowledge have made them very popular in diverse fields such as DNA sequence alignment, gene discovery, handwriting analysis, voice recognition, computer vision, language translation and partsofspeech tagging. A HMM is defined as a tuple (S, O, Theta) where S is a finite set of unobservable, hidden states emitting symbols from a finite observable vocabulary set O according to a probabilistic model Theta. The parameters of the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic transition matrix of the hidden states of size S X S. The elements a_(i,j) of A specify the probability of transitioning from a state i to state j. Matrix B is a size S X O stochastic symbol emission matrix whose elements b_(s, o) provide the probability that a symbol o will be emitted from the hidden state s. The elements pi_(s) of the S length vector Pi determine the probability that the system starts in the hidden state s. The transitions of hidden states are unobservable and follow the Markov property of memorylessness. Rabiner [1] defined three main problems for HMMs: 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the observation sequence, determine the probability that the model generated the observed sequence. This is useful for evaluating the quality of the model and is solved using the so called Forward algorithm. 2. Decoding: Given the complete model (S, O, Theta) and an observation sequence, determine the hidden state sequence which generated the observed sequence. This can be viewed as an inference problem where the model and observed sequence are used to predict the value of the unobservable random variables. The backward algorithm, also known as the Viterbi decoding algorithm is used for predicting the hidden state sequence. 3. Training: Given the set of hidden states S, the set of observation vocabulary O and the observation sequence, determine the parameters (A, B, Pi) of the model Theta. This problem can be viewed as a statistical machine learning problem of model fitting to a large set of training data. The BaumWelch (BW) algorithm (also called the ForwardBackward algorithm) and the Viterbi training algorithm are commonly used for model fitting. In general, the quality of HMM training can be improved by employing large training vectors but currently, Mahout only supports sequential versions of HMM trainers which are incapable of scaling. Among the Viterbi and the BaumWelch training methods, the BaumWelch algorithm is superior, accurate, and a better candidate for a parallel implementation for two reasons: (1) The BW is numerically stable and provides a guaranteed discovery of the locally maximum, Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In Viterbi training, the MLE is approximated in order to reduce computation time. (2) The BW belongs to the general class of Expectation Maximization (EM) algorithms which naturally fit into the MapReduce framework [2]. Hence, this project proposes to extend Mahout's current sequential implementation of the BaumWelch HMM trainer to a scalable, distributed case. Since the distributed version of the BW will use the sequential implementations of the Forward and the Backward algorithms to compute the alpha and the beta factors in each iteration, a lot of existing HMM training code will be reused. Specifically, the parallel implementation of the BW algorithm on Map Reduce has been elaborated at great length in [3] by viewing it as a specific case of the ExpectationMaximization algorithm and will be followed for implementation in this project. The BW EM algorithm iteratively refines the model's parameters and consists of two distinct steps in each iterationExpectation and Maximization. In the distributed case, the Expectation step is computed by the mappers and the reducers, while the Maximization is handled by the reducers. Starting from an initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is input to the algorithm, and the end result Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a user specified convergence condition expressed as a fixpoint or when the number of iterations exceeds a user defined value. Expectation computes the posterior probability of each latent variable for each observed variable, weighed by the relative frequency of the observed variable in the input split. The mappers process independent training instances and emit expected state transition and emission counts using the forwardbackward algorithm. The reducers finish Expectation by aggregating the expected counts. The input to a mapper consists of (k, v_o) pairs where k is a unique key and v_o is a string of observed symbols. For each training instance, the mappers emit the same set of keys corresponding to the following three optimization problems to be solved during the Maximization, and their values in a hashmap: (1) Expected number of times a hidden state is reached (Pi). (2) Number of times each observable symbol is generated by each hidden state (B). (3) Number of transitions between each pair of states in the hidden state space (A). The M step computes the updated Theta(i+1) from the values generated during the E part. This involves aggregating the values (as hashmaps) for each key corresponding to one of the optimization problems. The aggregation summarizes the statistics necessary to compute a subset of the parameters for the next EM iteration. The optimal parameters for the next iteration are arrived by computing the relative frequency of each event with respect to its expected count at the current iteration. The emitted optimal parameters by each reducer are written to the HDFS and are fed to the mappers in the next iteration. The project can be subdivided into distinct tasks of programming, testing and documenting the driver, mapper, reducer and the combiner with the Expectation and Maximization parts split between them. For each of these tasks, a new class will be programmed, unit tested and documented within the org.apache.mahout.classifier.sequencelearning.hmm package. A list of milestones, associated deliverable and high level implementation details is given below. Timeline: April 26  Aug 15. Milestones: April 26  May 22 (4 weeks): Precoding stage. Open communication with my mentor, refine the project's plan and requirements, understand the community's code styling requirements, expand the knowledge on Hadoop and Mahout internals. Thoroughly familiarize with the classes within the classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and math packages. May 23  June 3 (2 weeks): Work on Driver. Implement, test and document the class HmmDriver by extending the AbstractJob class and by reusing the code from the KMeansDriver class. June 3  July 1 (4 weeks): Work on Mapper. Implement, test and document the class HmmMapper. The HmmMapper class will include setup() and map() methods. The setup() method will read in the HmmModel and the parameter values obtained from the previous iteration. The map() method will call the HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() and complete the Expectation step partially. July 1  July 15 (2 weeks): Work on Reducer. Implement, test and document the class HmmReducer. The reducer will complete the Expectation step by summing over all the occurences emitted by the mappers for the three optimization problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if possible. Also, midterm review. July 15  July 29 (2 weeks): Work on Combiner. Implement, test and document the class HmmCombiner. The combiner will reduce the network traffic and improve efficiency by aggregating the values for each of the three keys corresponding to each of the optimization problems for the Maximization stage in reducers. Look at the possibility of code reuse from the KMeansCombiner class. July 29  August 15 (2 weeks): Final touches. Test the mapper, reducer, combiner and driver together. Give an example demonstrating the new parallel BW algorithm by employing the partsofspeech tagger data set also used by the sequential BW [4]. Tidy up code and fix loose ends, finish wiki documentation. Additional Information: I am in the final stages of finishing my Master's degree in Electrical and Computer Engineering from the University of Massachusetts Amherst. Working under the guidance of Prof. Wayne Burleson, as part of my Master's research work, I have applied the theory of Markov Decision Process (MDP) to increase the duration of service of mobile computers. This semester I am involved with two course projects involving machine learning over large data sets. In the Bioinformatics class, I am mining the RCSB Protein Data Bank [5] to learn the dependence of side chain geometry on a protein's secondary structure, and comparing it with the Dynamic Bayesian Network approach used in [6]. In another project for the Online Social Networks class, I am using reinforcement learning to build an online recommendation system by reformulating MDP optimal policy search as an EM problem [7] and employing Map Reduce (extending Mahout) to arrive at it in a scalable, distributed manner. I owe much to the open source community as all my research experiments have only been possible due to the freely available Linux distributions, performance analyzers, scripting languages and associated documentation. After joining the Apache Mahout's developer mailing list a few weeks ago, I have found the community extremely vibrant, helpful and welcoming. If selected, I feel that the GSOC 2011 project will be a great learning experience for me from both a technical and professional standpoint and will also allow me to contribute within my modest means to the overall spirit of open source programming and Machine Learning. References: [1] A tutorial on hidden markov models and selected applications in speech recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 (1989), pp. 257286. [2] MapReduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. In NIPS (2006), pp. 281288. [3] DataIntensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. Morgan & Claypool 2010. [4] http://flexcrfs.sourceforge.net/#Case_Study [5] http://www.rcsb.org/pdb/home/home.do [6] Beyond rotamers: a generative, probabilistic model of side chains in proteins by Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE, Hamelryck T. BMC Bioinformatics. 2010 Jun 5. [7] Probabilistic inference for solving discrete and continuous state Markov Decision Processes by M. Toussaint and A. Storkey. ICML, 2006. 
Proposal Title: BaumWelch Algorithm on MapReduce for Parallel Hidden Markov Model Training.
Student Name: Dhruv Kumar Student Email: dkumar@ecs.umass.edu Organization/Project: Apache Mahout Assigned Mentor: Proposal Abstract: The BaumWelch algorithm is commonly used for training a Hidden Markov Model because of its superior numerical stability and its ability to guarantee the discovery of a locally maximum, Maximum Likelihood Estimator, in the presence of incomplete training data. Currently, Apache Mahout has a sequential implementation of the BaumWelch which cannot be scaled to train over large data sets. This restriction reduces the quality of training and constrains generalization of the learned model when used for prediction. This project proposes to extend Mahout's BaumWelch to a parallel, distributed version using the MapReduce programming framework for enhanced model fitting over large data sets. Detailed Description: Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool for applications generating temporal or spatial sequential data. Relative simplicity of implementation, combined with their ability to discover latent domain knowledge have made them very popular in diverse fields such as DNA sequence alignment, gene discovery, handwriting analysis, voice recognition, computer vision, language translation and partsofspeech tagging. A HMM is defined as a tuple (S, O, Theta) where S is a finite set of unobservable, hidden states emitting symbols from a finite observable vocabulary set O according to a probabilistic model Theta. The parameters of the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic transition matrix of the hidden states of size S X S. The elements a_(i,j) of A specify the probability of transitioning from a state i to state j. Matrix B is a size S X O stochastic symbol emission matrix whose elements b_(s, o) provide the probability that a symbol o will be emitted from the hidden state s. The elements pi_(s) of the S length vector Pi determine the probability that the system starts in the hidden state s. The transitions of hidden states are unobservable and follow the Markov property of memorylessness. Rabiner [1] defined three main problems for HMMs: 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the observation sequence, determine the probability that the model generated the observed sequence. This is useful for evaluating the quality of the model and is solved using the so called Forward algorithm. 2. Decoding: Given the complete model (S, O, Theta) and an observation sequence, determine the hidden state sequence which generated the observed sequence. This can be viewed as an inference problem where the model and observed sequence are used to predict the value of the unobservable random variables. The backward algorithm, also known as the Viterbi decoding algorithm is used for predicting the hidden state sequence. 3. Training: Given the set of hidden states S, the set of observation vocabulary O and the observation sequence, determine the parameters (A, B, Pi) of the model Theta. This problem can be viewed as a statistical machine learning problem of model fitting to a large set of training data. The BaumWelch (BW) algorithm (also called the ForwardBackward algorithm) and the Viterbi training algorithm are commonly used for model fitting. In general, the quality of HMM training can be improved by employing large training vectors but currently, Mahout only supports sequential versions of HMM trainers which are incapable of scaling. Among the Viterbi and the BaumWelch training methods, the BaumWelch algorithm is superior, accurate, and a better candidate for a parallel implementation for two reasons: (1) The BW is numerically stable and provides a guaranteed discovery of the locally maximum, Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In Viterbi training, the MLE is approximated in order to reduce computation time. (2) The BW belongs to the general class of Expectation Maximization (EM) algorithms which naturally fit into the MapReduce framework [2], such as the existing Map Reduce implementation of kmeans in Mahout. Hence, this project proposes to extend Mahout's current sequential implementation of the BaumWelch HMM trainer to a scalable, distributed case. Since the distributed version of the BW will use the sequential implementations of the Forward and the Backward algorithms to compute the alpha and the beta factors in each iteration, a lot of existing HMM training code will be reused. Specifically, the parallel implementation of the BW algorithm on Map Reduce has been elaborated at great length in [3] by viewing it as a specific case of the ExpectationMaximization algorithm and will be followed for implementation in this project. The BW EM algorithm iteratively refines the model's parameters and consists of two distinct steps in each iterationExpectation and Maximization. In the distributed case, the Expectation step is computed by the mappers and the reducers, while the Maximization is handled by the reducers. Starting from an initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is input to the algorithm, and the end result Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a user specified convergence condition expressed as a fixpoint or when the number of iterations exceeds a user defined value. Expectation computes the posterior probability of each latent variable for each observed variable, weighed by the relative frequency of the observed variable in the input split. The mappers process independent training instances and emit expected state transition and emission counts using the Forward and Backward algorithms. The reducers finish Expectation by aggregating the expected counts. The input to a mapper consists of (k, v_o) pairs where k is a unique key and v_o is a string of observed symbols. For each training instance, the mappers emit the same set of keys corresponding to the following three optimization problems to be solved during the Maximization, and their values in a hashmap: (1) Expected number of times a hidden state is reached (Pi). (2) Number of times each observable symbol is generated by each hidden state (B). (3) Number of transitions between each pair of states in the hidden state space (A). The M step computes the updated Theta(i+1) from the values generated during the E part. This involves aggregating the values (as hashmaps) for each key corresponding to one of the optimization problems. The aggregation summarizes the statistics necessary to compute a subset of the parameters for the next EM iteration. The optimal parameters for the next iteration are arrived by computing the relative frequency of each event with respect to its expected count at the current iteration. The emitted optimal parameters by each reducer are written to the HDFS and are fed to the mappers in the next iteration. The project can be subdivided into distinct tasks of programming, testing and documenting the driver, mapper, reducer and the combiner with the Expectation and Maximization parts split between them. For each of these tasks, a new class will be programmed, unit tested and documented within the org.apache.mahout.classifier.sequencelearning.hmm package. Since kmeans is also an EM algorithm, particular attention will paid to its code at each step for possible reuse. A list of milestones, associated deliverable and high level implementation details is given below. Timeline: April 26  Aug 15. Milestones: April 26  May 22 (4 weeks): Precoding stage. Open communication with my mentor, refine the project's plan and requirements, understand the community's code styling requirements, expand the knowledge on Hadoop and Mahout internals. Thoroughly familiarize with the classes within the classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and math packages. May 23  June 3 (2 weeks): Work on Driver. Implement, test and document the class HmmDriver by extending the AbstractJob class and by reusing the code from the KMeansDriver class. June 3  July 1 (4 weeks): Work on Mapper. Implement, test and document the class HmmMapper. The HmmMapper class will include setup() and map() methods. The setup() method will read in the HmmModel and the parameter values obtained from the previous iteration. The map() method will call the HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() and complete the Expectation step partially. July 1  July 15 (2 weeks): Work on Reducer. Implement, test and document the class HmmReducer. The reducer will complete the Expectation step by summing over all the occurences emitted by the mappers for the three optimization problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if possible. Also, midterm review. July 15  July 29 (2 weeks): Work on Combiner. Implement, test and document the class HmmCombiner. The combiner will reduce the network traffic and improve efficiency by aggregating the values for each of the three keys corresponding to each of the optimization problems for the Maximization stage in reducers. Look at the possibility of code reuse from the KMeansCombiner class. July 29  August 15 (2 weeks): Final touches. Test the mapper, reducer, combiner and driver together. Give an example demonstrating the new parallel BW algorithm by employing the partsofspeech tagger data set also used by the sequential BW [4]. Tidy up code and fix loose ends, finish wiki documentation. Additional Information: I am in the final stages of finishing my Master's degree in Electrical and Computer Engineering from the University of Massachusetts Amherst. Working under the guidance of Prof. Wayne Burleson, as part of my Master's research work, I have applied the theory of Markov Decision Process (MDP) to increase the duration of service of mobile computers. This semester I am involved with two course projects involving machine learning over large data sets. In the Bioinformatics class, I am mining the RCSB Protein Data Bank [5] to learn the dependence of side chain geometry on a protein's secondary structure, and comparing it with the Dynamic Bayesian Network approach used in [6]. In another project for the Online Social Networks class, I am using reinforcement learning to build an online recommendation system by reformulating MDP optimal policy search as an EM problem [7] and employing Map Reduce (extending Mahout) to arrive at it in a scalable, distributed manner. I owe much to the open source community as all my research experiments have only been possible due to the freely available Linux distributions, performance analyzers, scripting languages and associated documentation. After joining the Apache Mahout's developer mailing list a few weeks ago, I have found the community extremely vibrant, helpful and welcoming. If selected, I feel that the GSOC 2011 project will be a great learning experience for me from both a technical and professional standpoint and will also allow me to contribute within my modest means to the overall spirit of open source programming and Machine Learning. References: [1] A tutorial on hidden markov models and selected applications in speech recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 (1989), pp. 257286. [2] MapReduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. In NIPS (2006), pp. 281288. [3] DataIntensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. Morgan & Claypool 2010. [4] http://flexcrfs.sourceforge.net/#Case_Study [5] http://www.rcsb.org/pdb/home/home.do [6] Beyond rotamers: a generative, probabilistic model of side chains in proteins by Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE, Hamelryck T. BMC Bioinformatics. 2010 Jun 5. [7] Probabilistic inference for solving discrete and continuous state Markov Decision Processes by M. Toussaint and A. Storkey. ICML, 2006. 
Description 
Proposal Title: BaumWelch Algorithm on MapReduce for Parallel Hidden Markov Model Training.
Student Name: Dhruv Kumar Student Email: dkumar@ecs.umass.edu Organization/Project: Apache Mahout Assigned Mentor: Proposal Abstract: The BaumWelch algorithm is commonly used for training a Hidden Markov Model because of its superior numerical stability and its ability to guarantee the discovery of a locally maximum, Maximum Likelihood Estimator, in the presence of incomplete training data. Currently, Apache Mahout has a sequential implementation of the BaumWelch which cannot be scaled to train over large data sets. This restriction reduces the quality of training and constrains generalization of the learned model in production environments. This project proposes to extend Mahout's sequential implementation of the BaumWelch to a parallel, distributed version using the MapReduce programming framework to allow training at a large scale for enhanced model fitting. Detailed Description: Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool for applications generating temporal or spatial sequential data. Relative simplicity of implementation, combined with their ability to discover latent domain knowledge have made them very popular in diverse fields such as DNA sequence alignment, gene discovery, handwriting analysis, voice recognition, computer vision, language translation and partsofspeech tagging. A HMM is defined as a tuple (S, O, Theta) where S is a finite set of unobservable, hidden states emitting symbols from a finite observable vocabulary set O according to a probabilistic model Theta. The parameters of the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic transition matrix of the hidden states of size S X S. The elements a_(i,j) of A specify the probability of transitioning from a state i to state j. Matrix B is a size S X O stochastic symbol emission matrix whose elements b_(s, o) provide the probability that a symbol o will be emitted from the hidden state s. The elements pi_(s) of the S length vector Pi determine the probability that the system starts in the hidden state s. The transitions of hidden states are unobservable and follow the Markov property of memorylessness. Rabiner [1] defined three main problems for HMMs: 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the observation sequence, determine the probability that the model generated the observed sequence. This is useful for evaluating the quality of the model and is solved using the so called Forward algorithm. 2. Decoding: Given the complete model (S, O, Theta) and an observation sequence, determine the hidden state sequence which generated the observed sequence. This can be viewed as an inference problem where the model and observed sequence are used to predict the value of the unobservable random variables. The backward algorithm, also known as the Viterbi decoding algorithm is used for predicting the hidden state sequence. 3. Training: Given the set of hidden states S, the set of observation vocabulary O and the observation sequence, determine the parameters (A, B, Pi) of the model Theta. This problem can be viewed as a statistical machine learning problem of model fitting to a large set of training data. The BaumWelch (BW) algorithm (also called the ForwardBackward algorithm) and the Viterbi training algorithm are commonly used for model fitting. In general, the quality of HMM training can be improved by employing large training vectors but currently, Mahout only supports sequential versions of HMM trainers which are incapable of scaling. Among the Viterbi and the BaumWelch training methods, the BaumWelch algorithm is superior, accurate, and a better candidate for a parallel implementation for two reasons: (1) The BW is numerically stable and provides a guaranteed discovery of the locally maximum, Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In Viterbi training, the MLE is approximated in order to reduce computation time. (2) The BW belongs to the general class of Expectation Maximization (EM) algorithms which naturally fit into the MapReduce framework [2]. Hence, this project proposes to extend Mahout's current sequential implementation of the BaumWelch HMM trainer to a scalable, distributed case. Since the distributed version of the BW will use the sequential implementations of the Forward and the Backward algorithms to compute the alpha and the beta factors in each iteration, a lot of existing HMM training code will be reused. Specifically, the parallel implementation of the BW algorithm on Map Reduce has been elaborated at great length in [3] by viewing it as a specific case of the ExpectationMaximization algorithm and will be followed for implementation in this project. The BW EM algorithm iteratively refines the model's parameters and consists of two distinct steps in each iterationExpectation and Maximization. In the distributed case, the Expectation step is computed by the mappers and the reducers, while the Maximization is handled by the reducers. Starting from an initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is input to the algorithm, and the end result Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a user specified convergence condition expressed as a fixpoint or when the number of iterations exceeds a user defined value. Expectation computes the posterior probability of each latent variable for each observed variable, weighed by the relative frequency of the observed variable in the input split. The mappers process independent training instances and emit expected state transition and emission counts using the forwardbackward algorithm. The reducers finish Expectation by aggregating the expected counts. The input to a mapper consists of (k, v_o) pairs where k is a unique key and v_o is a string of observed symbols. For each training instance, the mappers emit the same set of keys corresponding to the following three optimization problems to be solved during the Maximization, and their values in a hashmap: (1) Expected number of times a hidden state is reached (Pi). (2) Number of times each observable symbol is generated by each hidden state (B). (3) Number of transitions between each pair of states in the hidden state space (A). The M step computes the updated Theta(i+1) from the values generated during the E part. This involves aggregating the values (as hashmaps) for each key corresponding to one of the optimization problems. The aggregation summarizes the statistics necessary to compute a subset of the parameters for the next EM iteration. The optimal parameters for the next iteration are arrived by computing the relative frequency of each event with respect to its expected count at the current iteration. The emitted optimal parameters by each reducer are written to the HDFS and are fed to the mappers in the next iteration. The project can be subdivided into distinct tasks of programming, testing and documenting the driver, mapper, reducer and the combiner with the Expectation and Maximization parts split between them. For each of these tasks, a new class will be programmed, unit tested and documented within the org.apache.mahout.classifier.sequencelearning.hmm package. A list of milestones, associated deliverable and high level implementation details is given below. Timeline: April 26  Aug 15. Milestones: April 26  May 22 (4 weeks): Precoding stage. Open communication with my mentor, refine the project's plan and requirements, understand the community's code styling requirements, expand the knowledge on Hadoop and Mahout internals. Thoroughly familiarize with the classes within the classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and math packages. May 23  June 3 (2 weeks): Work on Driver. Implement, test and document the class HmmDriver by extending the AbstractJob class and by reusing the code from the KMeansDriver class. June 3  July 1 (4 weeks): Work on Mapper. Implement, test and document the class HmmMapper. The HmmMapper class will include setup() and map() methods. The setup() method will read in the HmmModel and the parameter values obtained from the previous iteration. The map() method will call the HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() and complete the Expectation step partially. July 1  July 15 (2 weeks): Work on Reducer. Implement, test and document the class HmmReducer. The reducer will complete the Expectation step by summing over all the occurences emitted by the mappers for the three optimization problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if possible. Also, midterm review. July 15  July 29 (2 weeks): Work on Combiner. Implement, test and document the class HmmCombiner. The combiner will reduce the network traffic and improve efficiency by aggregating the values for each of the three keys corresponding to each of the optimization problems for the Maximization stage in reducers. Look at the possibility of code reuse from the KMeansCombiner class. July 29  August 15 (2 weeks): Final touches. Test the mapper, reducer, combiner and driver together. Give an example demonstrating the new parallel BW algorithm by employing the partsofspeech tagger data set also used by the sequential BW [4]. Tidy up code and fix loose ends, finish wiki documentation. Additional Information: I am in the final stages of finishing my Master's degree in Electrical and Computer Engineering from the University of Massachusetts Amherst. Working under the guidance of Prof. Wayne Burleson, as part of my Master's research work, I have applied the theory of Markov Decision Process (MDP) to increase the duration of service of mobile computers. This semester I am involved with two course projects involving machine learning over large data sets. In the Bioinformatics class, I am mining the RCSB Protein Data Bank [5] to learn the dependence of side chain geometry on a protein's secondary structure, and comparing it with the Dynamic Bayesian Network approach used in [6]. In another project for the Online Social Networks class, I am using reinforcement learning to build an online recommendation system by reformulating MDP optimal policy search as an EM problem [7] and employing Map Reduce (extending Mahout) to arrive at it in a scalable, distributed manner. I owe much to the open source community as all my research experiments have only been possible due to the freely available Linux distributions, performance analyzers, scripting languages and associated documentation. After joining the Apache Mahout's developer mailing list a few weeks ago, I have found the community extremely vibrant, helpful and welcoming. If selected, I feel that the GSOC 2011 project will be a great learning experience for me from both a technical and professional standpoint and will also allow me to contribute within my modest means to the overall spirit of open source programming and Machine Learning. References: [1] A tutorial on hidden markov models and selected applications in speech recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 (1989), pp. 257286. [2] MapReduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. In NIPS (2006), pp. 281288. [3] DataIntensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. Morgan & Claypool 2010. [4] http://flexcrfs.sourceforge.net/#Case_Study [5] http://www.rcsb.org/pdb/home/home.do [6] Beyond rotamers: a generative, probabilistic model of side chains in proteins by Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE, Hamelryck T. BMC Bioinformatics. 2010 Jun 5. [7] Probabilistic inference for solving discrete and continuous state Markov Decision Processes by M. Toussaint and A. Storkey. ICML, 2006. 
Proposal Title: BaumWelch Algorithm on MapReduce for Parallel Hidden Markov Model Training.
Student Name: Dhruv Kumar Student Email: dkumar@ecs.umass.edu Organization/Project: Apache Mahout Assigned Mentor: Proposal Abstract: The BaumWelch algorithm is commonly used for training a Hidden Markov Model because of its superior numerical stability and its ability to guarantee the discovery of a locally maximum, Maximum Likelihood Estimator, in the presence of incomplete training data. Currently, Apache Mahout has a sequential implementation of the BaumWelch which cannot be scaled to train over large data sets. This restriction reduces the quality of training and constrains generalization of the learned model when used for prediction. This project proposes to extend Mahout's sequential implementation of the BaumWelch to a parallel, distributed version using the MapReduce programming framework to allow training at a large scale for enhanced model fitting. Detailed Description: Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool for applications generating temporal or spatial sequential data. Relative simplicity of implementation, combined with their ability to discover latent domain knowledge have made them very popular in diverse fields such as DNA sequence alignment, gene discovery, handwriting analysis, voice recognition, computer vision, language translation and partsofspeech tagging. A HMM is defined as a tuple (S, O, Theta) where S is a finite set of unobservable, hidden states emitting symbols from a finite observable vocabulary set O according to a probabilistic model Theta. The parameters of the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic transition matrix of the hidden states of size S X S. The elements a_(i,j) of A specify the probability of transitioning from a state i to state j. Matrix B is a size S X O stochastic symbol emission matrix whose elements b_(s, o) provide the probability that a symbol o will be emitted from the hidden state s. The elements pi_(s) of the S length vector Pi determine the probability that the system starts in the hidden state s. The transitions of hidden states are unobservable and follow the Markov property of memorylessness. Rabiner [1] defined three main problems for HMMs: 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the observation sequence, determine the probability that the model generated the observed sequence. This is useful for evaluating the quality of the model and is solved using the so called Forward algorithm. 2. Decoding: Given the complete model (S, O, Theta) and an observation sequence, determine the hidden state sequence which generated the observed sequence. This can be viewed as an inference problem where the model and observed sequence are used to predict the value of the unobservable random variables. The backward algorithm, also known as the Viterbi decoding algorithm is used for predicting the hidden state sequence. 3. Training: Given the set of hidden states S, the set of observation vocabulary O and the observation sequence, determine the parameters (A, B, Pi) of the model Theta. This problem can be viewed as a statistical machine learning problem of model fitting to a large set of training data. The BaumWelch (BW) algorithm (also called the ForwardBackward algorithm) and the Viterbi training algorithm are commonly used for model fitting. In general, the quality of HMM training can be improved by employing large training vectors but currently, Mahout only supports sequential versions of HMM trainers which are incapable of scaling. Among the Viterbi and the BaumWelch training methods, the BaumWelch algorithm is superior, accurate, and a better candidate for a parallel implementation for two reasons: (1) The BW is numerically stable and provides a guaranteed discovery of the locally maximum, Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In Viterbi training, the MLE is approximated in order to reduce computation time. (2) The BW belongs to the general class of Expectation Maximization (EM) algorithms which naturally fit into the MapReduce framework [2]. Hence, this project proposes to extend Mahout's current sequential implementation of the BaumWelch HMM trainer to a scalable, distributed case. Since the distributed version of the BW will use the sequential implementations of the Forward and the Backward algorithms to compute the alpha and the beta factors in each iteration, a lot of existing HMM training code will be reused. Specifically, the parallel implementation of the BW algorithm on Map Reduce has been elaborated at great length in [3] by viewing it as a specific case of the ExpectationMaximization algorithm and will be followed for implementation in this project. The BW EM algorithm iteratively refines the model's parameters and consists of two distinct steps in each iterationExpectation and Maximization. In the distributed case, the Expectation step is computed by the mappers and the reducers, while the Maximization is handled by the reducers. Starting from an initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is input to the algorithm, and the end result Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a user specified convergence condition expressed as a fixpoint or when the number of iterations exceeds a user defined value. Expectation computes the posterior probability of each latent variable for each observed variable, weighed by the relative frequency of the observed variable in the input split. The mappers process independent training instances and emit expected state transition and emission counts using the forwardbackward algorithm. The reducers finish Expectation by aggregating the expected counts. The input to a mapper consists of (k, v_o) pairs where k is a unique key and v_o is a string of observed symbols. For each training instance, the mappers emit the same set of keys corresponding to the following three optimization problems to be solved during the Maximization, and their values in a hashmap: (1) Expected number of times a hidden state is reached (Pi). (2) Number of times each observable symbol is generated by each hidden state (B). (3) Number of transitions between each pair of states in the hidden state space (A). The M step computes the updated Theta(i+1) from the values generated during the E part. This involves aggregating the values (as hashmaps) for each key corresponding to one of the optimization problems. The aggregation summarizes the statistics necessary to compute a subset of the parameters for the next EM iteration. The optimal parameters for the next iteration are arrived by computing the relative frequency of each event with respect to its expected count at the current iteration. The emitted optimal parameters by each reducer are written to the HDFS and are fed to the mappers in the next iteration. The project can be subdivided into distinct tasks of programming, testing and documenting the driver, mapper, reducer and the combiner with the Expectation and Maximization parts split between them. For each of these tasks, a new class will be programmed, unit tested and documented within the org.apache.mahout.classifier.sequencelearning.hmm package. A list of milestones, associated deliverable and high level implementation details is given below. Timeline: April 26  Aug 15. Milestones: April 26  May 22 (4 weeks): Precoding stage. Open communication with my mentor, refine the project's plan and requirements, understand the community's code styling requirements, expand the knowledge on Hadoop and Mahout internals. Thoroughly familiarize with the classes within the classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and math packages. May 23  June 3 (2 weeks): Work on Driver. Implement, test and document the class HmmDriver by extending the AbstractJob class and by reusing the code from the KMeansDriver class. June 3  July 1 (4 weeks): Work on Mapper. Implement, test and document the class HmmMapper. The HmmMapper class will include setup() and map() methods. The setup() method will read in the HmmModel and the parameter values obtained from the previous iteration. The map() method will call the HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() and complete the Expectation step partially. July 1  July 15 (2 weeks): Work on Reducer. Implement, test and document the class HmmReducer. The reducer will complete the Expectation step by summing over all the occurences emitted by the mappers for the three optimization problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if possible. Also, midterm review. July 15  July 29 (2 weeks): Work on Combiner. Implement, test and document the class HmmCombiner. The combiner will reduce the network traffic and improve efficiency by aggregating the values for each of the three keys corresponding to each of the optimization problems for the Maximization stage in reducers. Look at the possibility of code reuse from the KMeansCombiner class. July 29  August 15 (2 weeks): Final touches. Test the mapper, reducer, combiner and driver together. Give an example demonstrating the new parallel BW algorithm by employing the partsofspeech tagger data set also used by the sequential BW [4]. Tidy up code and fix loose ends, finish wiki documentation. Additional Information: I am in the final stages of finishing my Master's degree in Electrical and Computer Engineering from the University of Massachusetts Amherst. Working under the guidance of Prof. Wayne Burleson, as part of my Master's research work, I have applied the theory of Markov Decision Process (MDP) to increase the duration of service of mobile computers. This semester I am involved with two course projects involving machine learning over large data sets. In the Bioinformatics class, I am mining the RCSB Protein Data Bank [5] to learn the dependence of side chain geometry on a protein's secondary structure, and comparing it with the Dynamic Bayesian Network approach used in [6]. In another project for the Online Social Networks class, I am using reinforcement learning to build an online recommendation system by reformulating MDP optimal policy search as an EM problem [7] and employing Map Reduce (extending Mahout) to arrive at it in a scalable, distributed manner. I owe much to the open source community as all my research experiments have only been possible due to the freely available Linux distributions, performance analyzers, scripting languages and associated documentation. After joining the Apache Mahout's developer mailing list a few weeks ago, I have found the community extremely vibrant, helpful and welcoming. If selected, I feel that the GSOC 2011 project will be a great learning experience for me from both a technical and professional standpoint and will also allow me to contribute within my modest means to the overall spirit of open source programming and Machine Learning. References: [1] A tutorial on hidden markov models and selected applications in speech recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 (1989), pp. 257286. [2] MapReduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. In NIPS (2006), pp. 281288. [3] DataIntensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. Morgan & Claypool 2010. [4] http://flexcrfs.sourceforge.net/#Case_Study [5] http://www.rcsb.org/pdb/home/home.do [6] Beyond rotamers: a generative, probabilistic model of side chains in proteins by Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE, Hamelryck T. BMC Bioinformatics. 2010 Jun 5. [7] Probabilistic inference for solving discrete and continuous state Markov Decision Processes by M. Toussaint and A. Storkey. ICML, 2006. 
Description 
Proposal Title: BaumWelch Algorithm on MapReduce for Parallel Hidden Markov Model Training.
Student Name: Dhruv Kumar Student Email: dkumar@ecs.umass.edu Organization/Project: Apache Mahout Assigned Mentor: Proposal Abstract: The BaumWelch algorithm is commonly used for training a Hidden Markov Model because of its superior numerical stability and its ability to guarantee the discovery of a locally maximum, Maximum Likelihood Estimator, in the presence of incomplete training data. Currently, Apache Mahout has a sequential implementation of the BaumWelch which cannot be scaled to train over large data sets. This restriction reduces the quality of training and as an effect, constraints generalization of the learned model in production environments. This project proposes to extend Mahout's sequential implementation of the BaumWelch to a parallel, distributed version using the MapReduce programming framework to allow training at a large scale for enhanced model fitting. Detailed Description: Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool for applications generating temporal or spatial sequential data. Relative simplicity of implementation, combined with their ability to discover latent domain knowledge have made them very popular in diverse fields such as DNA sequence alignment, gene discovery, handwriting analysis, voice recognition, computer vision, language translation and partsofspeech tagging. A HMM is defined as a tuple (S, O, Theta) where S is a finite set of unobservable, hidden states emitting symbols from a finite observable vocabulary set O according to a probabilistic model Theta. The parameters of the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic transition matrix of the hidden states of size S X S. The elements a_(i,j) of A specify the probability of transitioning from a state i to state j. Matrix B is a size S X O stochastic symbol emission matrix whose elements b_(s, o) provide the probability that a symbol o will be emitted from the hidden state s. The elements pi_(s) of the S length vector Pi determine the probability that the system starts in the hidden state s. The transitions of hidden states are unobservable and follow the Markov property of memorylessness. Rabiner [1] defined three main problems for HMMs: 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the observation sequence, determine the probability that the model generated the observed sequence. This is useful for evaluating the quality of the model and is solved using the so called Forward algorithm. 2. Decoding: Given the complete model (S, O, Theta) and an observation sequence, determine the hidden state sequence which generated the observed sequence. This can be viewed as an inference problem where the model and observed sequence are used to predict the value of the unobservable random variables. The backward algorithm, also known as the Viterbi decoding algorithm is used for predicting the hidden state sequence. 3. Training: Given the set of hidden states S, the set of observation vocabulary O and the observation sequence, determine the parameters (A, B, Pi) of the model Theta. This problem can be viewed as a statistical machine learning problem of model fitting to a large set of training data. The BaumWelch (BW) algorithm (also called the ForwardBackward algorithm) and the Viterbi training algorithm are commonly used for model fitting. In general, the quality of HMM training can be improved by employing large training vectors but currently, Mahout only supports sequential versions of HMM trainers which are incapable of scaling. Among the Viterbi and the BaumWelch training methods, the BaumWelch algorithm is superior, accurate, and a better candidate for a parallel implementation for two reasons: (1) The BW is numerically stable and provides a guaranteed discovery of the locally maximum, Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In Viterbi training, the MLE is approximated in order to reduce computation time. (2) The BW belongs to the general class of Expectation Maximization (EM) algorithms which naturally fit into the MapReduce framework [2]. Hence, this project proposes to extend Mahout's current sequential implementation of the BaumWelch HMM trainer to a scalable, distributed case. Since the distributed version of the BW will use the sequential implementations of the Forward and the Backward algorithms to compute the alpha and the beta factors in each iteration, a lot of existing HMM training code will be reused. Specifically, the parallel implementation of the BW algorithm on Map Reduce has been elaborated at great length in [3] by viewing it as a specific case of the ExpectationMaximization algorithm and will be followed for implementation in this project. The BW EM algorithm iteratively refines the model's parameters and consists of two distinct steps in each iterationExpectation and Maximization. In the distributed case, the Expectation step is computed by the mappers and the reducers, while the Maximization is handled by the reducers. Starting from an initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is input to the algorithm, and the end result Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a user specified convergence condition expressed as a fixpoint or when the number of iterations exceeds a user defined value. Expectation computes the posterior probability of each latent variable for each observed variable, weighed by the relative frequency of the observed variable in the input split. The mappers process independent training instances and emit expected state transition and emission counts using the forwardbackward algorithm. The reducers finish Expectation by aggregating the expected counts. The input to a mapper consists of (k, v_o) pairs where k is a unique key and v_o is a string of observed symbols. For each training instance, the mappers emit the same set of keys corresponding to the following three optimization problems to be solved during the Maximization, and their values in a hashmap: (1) Expected number of times a hidden state is reached (Pi). (2) Number of times each observable symbol is generated by each hidden state (B). (3) Number of transitions between each pair of states in the hidden state space (A). The M step computes the updated Theta(i+1) from the values generated during the E part. This involves aggregating the values (as hashmaps) for each key corresponding to one of the optimization problems. The aggregation summarizes the statistics necessary to compute a subset of the parameters for the next EM iteration. The optimal parameters for the next iteration are arrived by computing the relative frequency of each event with respect to its expected count at the current iteration. The emitted optimal parameters by each reducer are written to the HDFS and are fed to the mappers in the next iteration. The project can be subdivided into distinct tasks of programming, testing and documenting the driver, mapper, reducer and the combiner with the Expectation and Maximization parts split between them. For each of these tasks, a new class will be programmed, unit tested and documented within the org.apache.mahout.classifier.sequencelearning.hmm package. A list of milestones, associated deliverable and high level implementation details is given below. Timeline: April 26  Aug 15. Milestones: April 26  May 22 (4 weeks): Precoding stage. Open communication with my mentor, refine the project's plan and requirements, understand the community's code styling requirements, expand the knowledge on Hadoop and Mahout internals. Thoroughly familiarize with the classes within the classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and math packages. May 23  June 3 (2 weeks): Work on Driver. Implement, test and document the class HmmDriver by extending the AbstractJob class and by reusing the code from the KMeansDriver class. June 3  July 1 (4 weeks): Work on Mapper. Implement, test and document the class HmmMapper. The HmmMapper class will include setup() and map() methods. The setup() method will read in the HmmModel and the parameter values obtained from the previous iteration. The map() method will call the HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() and complete the Expectation step partially. July 1  July 15 (2 weeks): Work on Reducer. Implement, test and document the class HmmReducer. The reducer will complete the Expectation step by summing over all the occurences emitted by the mappers for the three optimization problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if possible. Also, midterm review. July 15  July 29 (2 weeks): Work on Combiner. Implement, test and document the class HmmCombiner. The combiner will reduce the network traffic and improve efficiency by aggregating the values for each of the three keys corresponding to each of the optimization problems for the Maximization stage in reducers. Look at the possibility of code reuse from the KMeansCombiner class. July 29  August 15 (2 weeks): Final touches. Test the mapper, reducer, combiner and driver together. Give an example demonstrating the new parallel BW algorithm by employing the partsofspeech tagger data set also used by the sequential BW [4]. Tidy up code and fix loose ends, finish wiki documentation. Additional Information: I am in the final stages of finishing my Master's degree in Electrical and Computer Engineering from the University of Massachusetts Amherst. Working under the guidance of Prof. Wayne Burleson, as part of my Master's research work, I have applied the theory of Markov Decision Process (MDP) to increase the duration of service of mobile computers. This semester I am involved with two course projects involving machine learning over large data sets. In the Bioinformatics class, I am mining the RCSB Protein Data Bank [5] to learn the dependence of side chain geometry on a protein's secondary structure, and comparing it with the Dynamic Bayesian Network approach used in [6]. In another project for the Online Social Networks class, I am using reinforcement learning to build an online recommendation system by reformulating MDP optimal policy search as an EM problem [7] and employing Map Reduce (extending Mahout) to arrive at it in a scalable, distributed manner. I owe much to the open source community as all my research experiments have only been possible due to the freely available Linux distributions, performance analyzers, scripting languages and associated documentation. After joining the Apache Mahout's developer mailing list a few weeks ago, I have found the community extremely vibrant, helpful and welcoming. If selected, I feel that the GSOC 2011 project will be a great learning experience for me from both a technical and professional standpoint and will also allow me to contribute within my modest means to the overall spirit of open source programming and Machine Learning. References: [1] A tutorial on hidden markov models and selected applications in speech recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 (1989), pp. 257286. [2] MapReduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. In NIPS (2006), pp. 281288. [3] DataIntensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. Morgan & Claypool 2010. [4] http://flexcrfs.sourceforge.net/#Case_Study [5] http://www.rcsb.org/pdb/home/home.do [6] Beyond rotamers: a generative, probabilistic model of side chains in proteins by Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE, Hamelryck T. BMC Bioinformatics. 2010 Jun 5. [7] Probabilistic inference for solving discrete and continuous state Markov Decision Processes by M. Toussaint and A. Storkey. ICML, 2006. 
Proposal Title: BaumWelch Algorithm on MapReduce for Parallel Hidden Markov Model Training.
Student Name: Dhruv Kumar Student Email: dkumar@ecs.umass.edu Organization/Project: Apache Mahout Assigned Mentor: Proposal Abstract: The BaumWelch algorithm is commonly used for training a Hidden Markov Model because of its superior numerical stability and its ability to guarantee the discovery of a locally maximum, Maximum Likelihood Estimator, in the presence of incomplete training data. Currently, Apache Mahout has a sequential implementation of the BaumWelch which cannot be scaled to train over large data sets. This restriction reduces the quality of training and constrains generalization of the learned model in production environments. This project proposes to extend Mahout's sequential implementation of the BaumWelch to a parallel, distributed version using the MapReduce programming framework to allow training at a large scale for enhanced model fitting. Detailed Description: Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool for applications generating temporal or spatial sequential data. Relative simplicity of implementation, combined with their ability to discover latent domain knowledge have made them very popular in diverse fields such as DNA sequence alignment, gene discovery, handwriting analysis, voice recognition, computer vision, language translation and partsofspeech tagging. A HMM is defined as a tuple (S, O, Theta) where S is a finite set of unobservable, hidden states emitting symbols from a finite observable vocabulary set O according to a probabilistic model Theta. The parameters of the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic transition matrix of the hidden states of size S X S. The elements a_(i,j) of A specify the probability of transitioning from a state i to state j. Matrix B is a size S X O stochastic symbol emission matrix whose elements b_(s, o) provide the probability that a symbol o will be emitted from the hidden state s. The elements pi_(s) of the S length vector Pi determine the probability that the system starts in the hidden state s. The transitions of hidden states are unobservable and follow the Markov property of memorylessness. Rabiner [1] defined three main problems for HMMs: 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the observation sequence, determine the probability that the model generated the observed sequence. This is useful for evaluating the quality of the model and is solved using the so called Forward algorithm. 2. Decoding: Given the complete model (S, O, Theta) and an observation sequence, determine the hidden state sequence which generated the observed sequence. This can be viewed as an inference problem where the model and observed sequence are used to predict the value of the unobservable random variables. The backward algorithm, also known as the Viterbi decoding algorithm is used for predicting the hidden state sequence. 3. Training: Given the set of hidden states S, the set of observation vocabulary O and the observation sequence, determine the parameters (A, B, Pi) of the model Theta. This problem can be viewed as a statistical machine learning problem of model fitting to a large set of training data. The BaumWelch (BW) algorithm (also called the ForwardBackward algorithm) and the Viterbi training algorithm are commonly used for model fitting. In general, the quality of HMM training can be improved by employing large training vectors but currently, Mahout only supports sequential versions of HMM trainers which are incapable of scaling. Among the Viterbi and the BaumWelch training methods, the BaumWelch algorithm is superior, accurate, and a better candidate for a parallel implementation for two reasons: (1) The BW is numerically stable and provides a guaranteed discovery of the locally maximum, Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In Viterbi training, the MLE is approximated in order to reduce computation time. (2) The BW belongs to the general class of Expectation Maximization (EM) algorithms which naturally fit into the MapReduce framework [2]. Hence, this project proposes to extend Mahout's current sequential implementation of the BaumWelch HMM trainer to a scalable, distributed case. Since the distributed version of the BW will use the sequential implementations of the Forward and the Backward algorithms to compute the alpha and the beta factors in each iteration, a lot of existing HMM training code will be reused. Specifically, the parallel implementation of the BW algorithm on Map Reduce has been elaborated at great length in [3] by viewing it as a specific case of the ExpectationMaximization algorithm and will be followed for implementation in this project. The BW EM algorithm iteratively refines the model's parameters and consists of two distinct steps in each iterationExpectation and Maximization. In the distributed case, the Expectation step is computed by the mappers and the reducers, while the Maximization is handled by the reducers. Starting from an initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is input to the algorithm, and the end result Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a user specified convergence condition expressed as a fixpoint or when the number of iterations exceeds a user defined value. Expectation computes the posterior probability of each latent variable for each observed variable, weighed by the relative frequency of the observed variable in the input split. The mappers process independent training instances and emit expected state transition and emission counts using the forwardbackward algorithm. The reducers finish Expectation by aggregating the expected counts. The input to a mapper consists of (k, v_o) pairs where k is a unique key and v_o is a string of observed symbols. For each training instance, the mappers emit the same set of keys corresponding to the following three optimization problems to be solved during the Maximization, and their values in a hashmap: (1) Expected number of times a hidden state is reached (Pi). (2) Number of times each observable symbol is generated by each hidden state (B). (3) Number of transitions between each pair of states in the hidden state space (A). The M step computes the updated Theta(i+1) from the values generated during the E part. This involves aggregating the values (as hashmaps) for each key corresponding to one of the optimization problems. The aggregation summarizes the statistics necessary to compute a subset of the parameters for the next EM iteration. The optimal parameters for the next iteration are arrived by computing the relative frequency of each event with respect to its expected count at the current iteration. The emitted optimal parameters by each reducer are written to the HDFS and are fed to the mappers in the next iteration. The project can be subdivided into distinct tasks of programming, testing and documenting the driver, mapper, reducer and the combiner with the Expectation and Maximization parts split between them. For each of these tasks, a new class will be programmed, unit tested and documented within the org.apache.mahout.classifier.sequencelearning.hmm package. A list of milestones, associated deliverable and high level implementation details is given below. Timeline: April 26  Aug 15. Milestones: April 26  May 22 (4 weeks): Precoding stage. Open communication with my mentor, refine the project's plan and requirements, understand the community's code styling requirements, expand the knowledge on Hadoop and Mahout internals. Thoroughly familiarize with the classes within the classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and math packages. May 23  June 3 (2 weeks): Work on Driver. Implement, test and document the class HmmDriver by extending the AbstractJob class and by reusing the code from the KMeansDriver class. June 3  July 1 (4 weeks): Work on Mapper. Implement, test and document the class HmmMapper. The HmmMapper class will include setup() and map() methods. The setup() method will read in the HmmModel and the parameter values obtained from the previous iteration. The map() method will call the HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() and complete the Expectation step partially. July 1  July 15 (2 weeks): Work on Reducer. Implement, test and document the class HmmReducer. The reducer will complete the Expectation step by summing over all the occurences emitted by the mappers for the three optimization problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if possible. Also, midterm review. July 15  July 29 (2 weeks): Work on Combiner. Implement, test and document the class HmmCombiner. The combiner will reduce the network traffic and improve efficiency by aggregating the values for each of the three keys corresponding to each of the optimization problems for the Maximization stage in reducers. Look at the possibility of code reuse from the KMeansCombiner class. July 29  August 15 (2 weeks): Final touches. Test the mapper, reducer, combiner and driver together. Give an example demonstrating the new parallel BW algorithm by employing the partsofspeech tagger data set also used by the sequential BW [4]. Tidy up code and fix loose ends, finish wiki documentation. Additional Information: I am in the final stages of finishing my Master's degree in Electrical and Computer Engineering from the University of Massachusetts Amherst. Working under the guidance of Prof. Wayne Burleson, as part of my Master's research work, I have applied the theory of Markov Decision Process (MDP) to increase the duration of service of mobile computers. This semester I am involved with two course projects involving machine learning over large data sets. In the Bioinformatics class, I am mining the RCSB Protein Data Bank [5] to learn the dependence of side chain geometry on a protein's secondary structure, and comparing it with the Dynamic Bayesian Network approach used in [6]. In another project for the Online Social Networks class, I am using reinforcement learning to build an online recommendation system by reformulating MDP optimal policy search as an EM problem [7] and employing Map Reduce (extending Mahout) to arrive at it in a scalable, distributed manner. I owe much to the open source community as all my research experiments have only been possible due to the freely available Linux distributions, performance analyzers, scripting languages and associated documentation. After joining the Apache Mahout's developer mailing list a few weeks ago, I have found the community extremely vibrant, helpful and welcoming. If selected, I feel that the GSOC 2011 project will be a great learning experience for me from both a technical and professional standpoint and will also allow me to contribute within my modest means to the overall spirit of open source programming and Machine Learning. References: [1] A tutorial on hidden markov models and selected applications in speech recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 (1989), pp. 257286. [2] MapReduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. In NIPS (2006), pp. 281288. [3] DataIntensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. Morgan & Claypool 2010. [4] http://flexcrfs.sourceforge.net/#Case_Study [5] http://www.rcsb.org/pdb/home/home.do [6] Beyond rotamers: a generative, probabilistic model of side chains in proteins by Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE, Hamelryck T. BMC Bioinformatics. 2010 Jun 5. [7] Probabilistic inference for solving discrete and continuous state Markov Decision Processes by M. Toussaint and A. Storkey. ICML, 2006. 
Description 
Proposal Title: BaumWelch Algorithm on MapReduce for Parallel Hidden Markov Model Training.
Student Name: Dhruv Kumar Student Email: dkumar@ecs.umass.edu Organization/Project: Apache Mahout Assigned Mentor: Proposal Abstract: The BaumWelch algorithm is commonly used for training a Hidden Markov Model as it is numerically stable and provides a guaranteed discovery of the Maximum Likelihood Estimator in the presence of incomplete data. Currently, Apache Mahout has a sequential implementation of the BaumWelch which cannot be scaled to train over large data sets. This project proposes to extend the sequential implementation of the BaumWelch to a parallel, distributed version using the Map Reduce programming framework to allow scalable Hidden Markov Model training. Detailed Description: Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool for applications generating temporal or spatial sequential data. Their relative simplicity of implementation and their ability to discover latent domain knowledge have made them very popular in fields such as DNA sequence alignment, handwriting analysis, voice recognition, computer vision and partsofspeech tagging. A HMM is defined as a tuple (S, O, Theta) where S is a finite set of unobservable, hidden states emitting symbols from a finite observable vocabulary set O according to a probabilistic model Theta. The parameters of the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic transition matrix of the hidden states of size S X S. The elements a_(i,j) of A specify the probability of transitioning from a state i to state j. Matrix B is a size S X O stochastic symbol emission matrix whose elements b_(s, o) provide the probability that a symbol o will be emitted from the hidden state s. The elements pi_(s) of the S length vector Pi determine the probability that the system starts in the hidden state s. The transitions of hidden states are unobservable and follow the Markov property of memorylessness. Rabiner [1] defined three main problems for HMMs: 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the observation sequence, determine the probability that the model generated the observed sequence. This is useful for determining the quality of the model and is solved using the so called Forward algorithm. 2. Decoding: Given the complete model (S, O, Theta) and an observation sequence, determine the hidden state sequence which generated the observed sequence. This can be viewed as an inference problem where the model and observed sequence are used to predict the value of the unobservable random variables. The backward algorithm, also known as the Viterbi decoding algorithm is used for predicting the hidden state sequence. 3. Training: Given the set of hidden states S, the set of observation vocabulary O and the observation sequence, determine the parameters (A, B, Pi) of the model Theta. This problem can be viewed as a statistical machine learning problem of model fitting to a large set of training data. The BaumWelch (BW) algorithm (also called the ForwardBackward algorithm) and the Viterbi training algorithm are commonly used for model fitting. In general, the quality of HMM training can be improved by employing large training vectors but currently, Mahout only supports sequential versions of HMM trainers which are incapable of scaling. Among the Viterbi and the BaumWelch training methods, the BaumWelch algorithm is slower but more accurate, and a better candidate for a parallel implementation for two reasons: (1) The BW is more numerically stable and provides a guaranteed local maximum of the Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In Viterbi training, the MLE is approximated in order to reduce computation time. (2) The BW belongs to the general class of Expectation Maximization (EM) algorithms which naturally fit into the Map Reduce framework [2]. Hence, this project proposes to extend Mahout's current sequential implementation of the BaumWelch HMM trainer to a scalable, distributed case. Since the distributed version of the BW will use the sequential implementations of the Forward and the Backward algorithms to compute the alpha and the beta factors in each iteration, a lot of existing HMM training code will be reused. Specifically, the parallel implementation of the BW algorithm on Map Reduce has been elaborated at great length in [3] by viewing it as a specific case of the ExpectationMaximization algorithm and will be followed for implementation in this project. The BW EM algorithm iteratively refines the model's parameters and consists of two distinct steps in each iterationExpectation and Maximization. In the distributed case, the Expectation step is computed by the mappers and the reducers, while the Maximization is handled by the reducers. Starting from an initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is input to the algorithm, and the end result Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a user specified convergence condition expressed as a fixpoint or when the number of iterations exceeds a user defined value. Expectation computes the posterior probability of each latent variable for each observed variable, weighed by the relative frequency of the observed variable in the input split. The mappers process independent training instances and emit expected state transition and emission counts using the forwardbackward algorithm. The reducers finish Expectation by aggregating the expected counts. The input to a mapper consists of (k, v_o) pairs where k is a unique key and v_o is a string of observed symbols. For each training instance, the mappers emit the same set of keys corresponding to the following three optimization problems to be solved during the Maximization, and their values in a hashmap: (1) Expected number of times a hidden state is reached (Pi). (2) Number of times each observable symbol is generated by each hidden state (B). (3) Number of transitions between each pair of states in the hidden state space (A). The M step computes the updated Theta(i+1) from the values generated during the E part. This involves aggregating the values (as hashmaps) for each key corresponding to one of the optimization problems. The aggregation summarizes the statistics necessary to compute a subset of the parameters for the next EM iteration. The optimal parameters for the next iteration are arrived by computing the relative frequency of each event with respect to its expected count at the current iteration. The emitted optimal parameters by each reducer are written to the HDFS and are fed to the mappers in the next iteration. The project can be subdivided into distinct tasks of programming, testing and documenting the driver, mapper, reducer and the combiner with the Expectation and Maximization parts split between them. For each of these tasks, a new class will be programmed, unit tested and documented within the org.apache.mahout.classifier.sequencelearning.hmm package. A list of milestones, associated deliverable and high level implementation details is given below. Timeline: April 26  Aug 15. Milestones: April 26  May 22 (4 weeks): Precoding stage. Open communication with my mentor, refine the project's plan and requirements, understand the community's code styling requirements, expand the knowledge on Hadoop and Mahout internals. Thoroughly familiarize with the classes within the classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and math packages. May 23  June 3 (2 weeks): Work on Driver. Implement, test and document the class HmmDriver by extending the AbstractJob class and by reusing the code from the KMeansDriver class. June 3  July 1 (4 weeks): Work on Mapper. Implement, test and document the class HmmMapper. The HmmMapper class will include setup() and map() methods. The setup() method will read in the HmmModel and the parameter values obtained from the previous iteration. The map() method will call the HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() and complete the Expectation step partially. July 1  July 15 (2 weeks): Work on Reducer. Implement, test and document the class HmmReducer. The reducer will complete the Expectation step by summing over all the occurences emitted by the mappers for the three optimization problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if possible. Also, midterm review. July 15  July 29 (2 weeks): Work on Combiner. Implement, test and document the class HmmCombiner. The combiner will reduce the network traffic and improve efficiency by aggregating the values for each of the three keys corresponding to each of the optimization problems for the Maximization stage in reducers. Look at the possibility of code reuse from the KMeansCombiner class. July 29  August 15 (2 weeks): Final touches. Test the mapper, reducer, combiner and driver together. Give an example demonstrating the new parallel BW algorithm by employing the partsofspeech tagger data set also used by the sequential BW [4]. Tidy up code and fix loose ends, finish wiki documentation. Additional Information: I am in the final stages of finishing my Master's degree in Electrical and Computer Engineering from the University of Massachusetts Amherst. Working under the guidance of Prof. Wayne Burleson, as part of my Master's research work, I have applied the theory of Markov Decision Process (MDP) to increase the duration of service of mobile computers. This semester I am involved with two course projects involving machine learning over large data sets. In the Bioinformatics class, I am mining the RCSB Protein Data Bank [5] to learn the dependence of side chain geometry on a protein's secondary structure, and comparing it with the Dynamic Bayesian Network approach used in [6]. In another project for the Online Social Networks class, I am using reinforcement learning to build an online recommendation system by reformulating MDP optimal policy search as an EM problem [7] and employing Map Reduce (extending Mahout) to arrive at it in a scalable, distributed manner. I owe much to the open source community as all my research experiments have only been possible due to the freely available Linux distributions, performance analyzers, scripting languages and documentation. Since joining the Apache dev mailing list, I have found the Apache Mahout's developer community vibrant, helpful and welcoming. If selected, I feel that the GSOC 2011 project will be a great learning experience for me from both a technical and professional standpoint and also allow me to contribute within my modest means to the overall spirit of open source programming. References: [1] A tutorial on hidden markov models and selected applications in speech recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 (1989), pp. 257286. [2] MapReduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. In NIPS (2006), pp. 281288. [3] DataIntensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. Morgan & Claypool 2010. [4] http://flexcrfs.sourceforge.net/#Case_Study [5] http://www.rcsb.org/pdb/home/home.do [6] Beyond rotamers: a generative, probabilistic model of side chains in proteins by Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE, Hamelryck T. BMC Bioinformatics. 2010 Jun 5. [7] M. Toussaint and A. Storkey. Probabilistic inference for solving discrete and continuous state Markov Decision Processes. In ICML, 2006. 
Proposal Title: BaumWelch Algorithm on MapReduce for Parallel Hidden Markov Model Training.
Student Name: Dhruv Kumar Student Email: dkumar@ecs.umass.edu Organization/Project: Apache Mahout Assigned Mentor: Proposal Abstract: The BaumWelch algorithm is commonly used for training a Hidden Markov Model because of its superior numerical stability and its ability to guarantee the discovery of a locally maximum, Maximum Likelihood Estimator, in the presence of incomplete training data. Currently, Apache Mahout has a sequential implementation of the BaumWelch which cannot be scaled to train over large data sets. This restriction reduces the quality of training and as an effect, constraints generalization of the learned model in production environments. This project proposes to extend Mahout's sequential implementation of the BaumWelch to a parallel, distributed version using the MapReduce programming framework to allow training at a large scale for enhanced model fitting. Detailed Description: Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool for applications generating temporal or spatial sequential data. Relative simplicity of implementation, combined with their ability to discover latent domain knowledge have made them very popular in diverse fields such as DNA sequence alignment, gene discovery, handwriting analysis, voice recognition, computer vision, language translation and partsofspeech tagging. A HMM is defined as a tuple (S, O, Theta) where S is a finite set of unobservable, hidden states emitting symbols from a finite observable vocabulary set O according to a probabilistic model Theta. The parameters of the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic transition matrix of the hidden states of size S X S. The elements a_(i,j) of A specify the probability of transitioning from a state i to state j. Matrix B is a size S X O stochastic symbol emission matrix whose elements b_(s, o) provide the probability that a symbol o will be emitted from the hidden state s. The elements pi_(s) of the S length vector Pi determine the probability that the system starts in the hidden state s. The transitions of hidden states are unobservable and follow the Markov property of memorylessness. Rabiner [1] defined three main problems for HMMs: 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the observation sequence, determine the probability that the model generated the observed sequence. This is useful for evaluating the quality of the model and is solved using the so called Forward algorithm. 2. Decoding: Given the complete model (S, O, Theta) and an observation sequence, determine the hidden state sequence which generated the observed sequence. This can be viewed as an inference problem where the model and observed sequence are used to predict the value of the unobservable random variables. The backward algorithm, also known as the Viterbi decoding algorithm is used for predicting the hidden state sequence. 3. Training: Given the set of hidden states S, the set of observation vocabulary O and the observation sequence, determine the parameters (A, B, Pi) of the model Theta. This problem can be viewed as a statistical machine learning problem of model fitting to a large set of training data. The BaumWelch (BW) algorithm (also called the ForwardBackward algorithm) and the Viterbi training algorithm are commonly used for model fitting. In general, the quality of HMM training can be improved by employing large training vectors but currently, Mahout only supports sequential versions of HMM trainers which are incapable of scaling. Among the Viterbi and the BaumWelch training methods, the BaumWelch algorithm is superior, accurate, and a better candidate for a parallel implementation for two reasons: (1) The BW is numerically stable and provides a guaranteed discovery of the locally maximum, Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In Viterbi training, the MLE is approximated in order to reduce computation time. (2) The BW belongs to the general class of Expectation Maximization (EM) algorithms which naturally fit into the MapReduce framework [2]. Hence, this project proposes to extend Mahout's current sequential implementation of the BaumWelch HMM trainer to a scalable, distributed case. Since the distributed version of the BW will use the sequential implementations of the Forward and the Backward algorithms to compute the alpha and the beta factors in each iteration, a lot of existing HMM training code will be reused. Specifically, the parallel implementation of the BW algorithm on Map Reduce has been elaborated at great length in [3] by viewing it as a specific case of the ExpectationMaximization algorithm and will be followed for implementation in this project. The BW EM algorithm iteratively refines the model's parameters and consists of two distinct steps in each iterationExpectation and Maximization. In the distributed case, the Expectation step is computed by the mappers and the reducers, while the Maximization is handled by the reducers. Starting from an initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is input to the algorithm, and the end result Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a user specified convergence condition expressed as a fixpoint or when the number of iterations exceeds a user defined value. Expectation computes the posterior probability of each latent variable for each observed variable, weighed by the relative frequency of the observed variable in the input split. The mappers process independent training instances and emit expected state transition and emission counts using the forwardbackward algorithm. The reducers finish Expectation by aggregating the expected counts. The input to a mapper consists of (k, v_o) pairs where k is a unique key and v_o is a string of observed symbols. For each training instance, the mappers emit the same set of keys corresponding to the following three optimization problems to be solved during the Maximization, and their values in a hashmap: (1) Expected number of times a hidden state is reached (Pi). (2) Number of times each observable symbol is generated by each hidden state (B). (3) Number of transitions between each pair of states in the hidden state space (A). The M step computes the updated Theta(i+1) from the values generated during the E part. This involves aggregating the values (as hashmaps) for each key corresponding to one of the optimization problems. The aggregation summarizes the statistics necessary to compute a subset of the parameters for the next EM iteration. The optimal parameters for the next iteration are arrived by computing the relative frequency of each event with respect to its expected count at the current iteration. The emitted optimal parameters by each reducer are written to the HDFS and are fed to the mappers in the next iteration. The project can be subdivided into distinct tasks of programming, testing and documenting the driver, mapper, reducer and the combiner with the Expectation and Maximization parts split between them. For each of these tasks, a new class will be programmed, unit tested and documented within the org.apache.mahout.classifier.sequencelearning.hmm package. A list of milestones, associated deliverable and high level implementation details is given below. Timeline: April 26  Aug 15. Milestones: April 26  May 22 (4 weeks): Precoding stage. Open communication with my mentor, refine the project's plan and requirements, understand the community's code styling requirements, expand the knowledge on Hadoop and Mahout internals. Thoroughly familiarize with the classes within the classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and math packages. May 23  June 3 (2 weeks): Work on Driver. Implement, test and document the class HmmDriver by extending the AbstractJob class and by reusing the code from the KMeansDriver class. June 3  July 1 (4 weeks): Work on Mapper. Implement, test and document the class HmmMapper. The HmmMapper class will include setup() and map() methods. The setup() method will read in the HmmModel and the parameter values obtained from the previous iteration. The map() method will call the HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() and complete the Expectation step partially. July 1  July 15 (2 weeks): Work on Reducer. Implement, test and document the class HmmReducer. The reducer will complete the Expectation step by summing over all the occurences emitted by the mappers for the three optimization problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if possible. Also, midterm review. July 15  July 29 (2 weeks): Work on Combiner. Implement, test and document the class HmmCombiner. The combiner will reduce the network traffic and improve efficiency by aggregating the values for each of the three keys corresponding to each of the optimization problems for the Maximization stage in reducers. Look at the possibility of code reuse from the KMeansCombiner class. July 29  August 15 (2 weeks): Final touches. Test the mapper, reducer, combiner and driver together. Give an example demonstrating the new parallel BW algorithm by employing the partsofspeech tagger data set also used by the sequential BW [4]. Tidy up code and fix loose ends, finish wiki documentation. Additional Information: I am in the final stages of finishing my Master's degree in Electrical and Computer Engineering from the University of Massachusetts Amherst. Working under the guidance of Prof. Wayne Burleson, as part of my Master's research work, I have applied the theory of Markov Decision Process (MDP) to increase the duration of service of mobile computers. This semester I am involved with two course projects involving machine learning over large data sets. In the Bioinformatics class, I am mining the RCSB Protein Data Bank [5] to learn the dependence of side chain geometry on a protein's secondary structure, and comparing it with the Dynamic Bayesian Network approach used in [6]. In another project for the Online Social Networks class, I am using reinforcement learning to build an online recommendation system by reformulating MDP optimal policy search as an EM problem [7] and employing Map Reduce (extending Mahout) to arrive at it in a scalable, distributed manner. I owe much to the open source community as all my research experiments have only been possible due to the freely available Linux distributions, performance analyzers, scripting languages and associated documentation. After joining the Apache Mahout's developer mailing list a few weeks ago, I have found the community extremely vibrant, helpful and welcoming. If selected, I feel that the GSOC 2011 project will be a great learning experience for me from both a technical and professional standpoint and will also allow me to contribute within my modest means to the overall spirit of open source programming and Machine Learning. References: [1] A tutorial on hidden markov models and selected applications in speech recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 (1989), pp. 257286. [2] MapReduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. In NIPS (2006), pp. 281288. [3] DataIntensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. Morgan & Claypool 2010. [4] http://flexcrfs.sourceforge.net/#Case_Study [5] http://www.rcsb.org/pdb/home/home.do [6] Beyond rotamers: a generative, probabilistic model of side chains in proteins by Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE, Hamelryck T. BMC Bioinformatics. 2010 Jun 5. [7] Probabilistic inference for solving discrete and continuous state Markov Decision Processes by M. Toussaint and A. Storkey. ICML, 2006. 
Description 
Proposal Title: BaumWelch Algorithm on MapReduce for Parallel Hidden Markov Model Training.
Student Name: Dhruv Kumar Student Email: dkumar@ecs.umass.edu Organization/Project: Apache Mahout Assigned Mentor: Proposal Abstract: The BaumWelch algorithm is commonly used for training a Hidden Markov Model as it is numerically stable and provides a guaranteed discovery of the Maximum Likelihood Estimator in the presence of incomplete data. Currently, Apache Mahout has a sequential implementation of the BaumWelch which cannot be scaled to train over large data sets. This project proposes to extend the sequential implementation of the BaumWelch to a parallel, distributed version using the Map Reduce programming framework to allow scalable Hidden Markov Model training. Detailed Description: Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool for applications generating temporal or spatial sequential data. Their relative simplicity of implementation and their ability to discover latent domain knowledge have made them very popular in fields such as DNA sequence alignment, handwriting analysis, voice recognition, computer vision and partsofspeech tagging. A HMM is defined as a tuple (S, O, Theta) where S is a finite set of unobservable, hidden states emitting symbols from a finite observable vocabulary set O according to a probabilistic model Theta. The parameters of the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic transition matrix of the hidden states of size S X S. The elements a_(i,j) of A specify the probability of transitioning from a state i to state j. Matrix B is a size S X O stochastic symbol emission matrix whose elements b_(s, o) provide the probability that a symbol o will be emitted from the hidden state s. The elements pi_(s) of the S length vector Pi determine the probability that the system starts in the hidden state s. The transitions of hidden states are unobservable and follow the Markov property of memorylessness. Rabiner [1] defined three main problems for HMMs: 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the observation sequence, determine the probability that the model generated the observed sequence. This is useful for determining the quality of the model and is solved using the so called Forward algorithm. 2. Decoding: Given the complete model (S, O, Theta) and an observation sequence, determine the hidden state sequence which generated the observed sequence. This can be viewed as an inference problem where the model and observed sequence are used to predict the value of the unobservable random variables. The backward algorithm, also known as the Viterbi decoding algorithm is used for predicting the hidden state sequence. 3. Training: Given the set of hidden states S, the set of observation vocabulary O and the observation sequence, determine the parameters (A, B, Pi) of the model Theta. This problem can be viewed as a statistical machine learning problem of model fitting to a large set of training data. The BaumWelch (BW) algorithm (also called the ForwardBackward algorithm) and the Viterbi training algorithm are commonly used for model fitting. In general, the quality of HMM training can be improved by employing large training vectors but currently, Mahout only supports sequential versions of HMM trainers which are incapable of scaling. Among the Viterbi and the BaumWelch training methods, the BaumWelch algorithm is slower but more accurate, and a better candidate for a parallel implementation for two reasons: (1) The BW is more numerically stable and provides a guaranteed local maximum of the Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In Viterbi training, the MLE is approximated in order to reduce computation time. (2) The BW belongs to the general class of Expectation Maximization (EM) algorithms which naturally fit into the Map Reduce framework [2]. Hence, this project proposes to extend Mahout's current sequential implementation of the BaumWelch HMM trainer to a scalable, distributed case. Since the distributed version of the BW will use the sequential implementations of the Forward and the Backward algorithms to compute the alpha and the beta factors in each iteration, a lot of existing HMM training code will be reused. Specifically, the parallel implementation of the BW algorithm on Map Reduce has been elaborated at great length in [3] by viewing it as a specific case of the ExpectationMaximization algorithm and will be followed for implementation in this project. The BW EM algorithm iteratively refines the model's parameters and consists of two distinct steps in each iterationExpectation and Maximization. In the distributed case, the Expectation step is computed by the mappers and the reducers, while the Maximization is handled by the reducers. Starting from an initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is input to the algorithm, and the end result Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a user specified convergence condition expressed as a fixpoint or when the number of iterations exceeds a user defined value. Expectation computes the posterior probability of each latent variable for each observed variable, weighed by the relative frequency of the observed variable in the input split. The mappers process independent training instances and emit expected state transition and emission counts using the forwardbackward algorithm. The reducers finish Expectation by aggregating the expected counts. The input to a mapper consists of (k, v_o) pairs where k is a unique key and v_o is a string of observed symbols. For each training instance, the mappers emit the same set of keys corresponding to the following three optimization problems to be solved during the Maximization, and their values in a hashmap: (1) Expected number of times a hidden state is reached (Pi). (2) Number of times each observable symbol is generated by each hidden state (B). (3) Number of transitions between each pair of states in the hidden state space (A). The M step computes the updated Theta(i+1) from the values generated during the E part. This involves aggregating the values (as hashmaps) for each key corresponding to one of the optimization problems. The aggregation summarizes the statistics necessary to compute a subset of the parameters for the next EM iteration. The optimal parameters for the next iteration are arrived by computing the relative frequency of each event with respect to its expected count at the current iteration. The emitted optimal parameters by each reducer are written to the HDFS and are fed to the mappers in the next iteration. The project can be subdivided into distinct tasks of programming, testing and documenting the driver, mapper, reducer and the combiner with the Expectation and Maximization parts split between them. For each of these tasks, a new class will be programmed, unit tested and documented within the org.apache.mahout.classifier.sequencelearning.hmm package. A list of milestones, associated deliverable and high level implementation details is given below. Timeline: April 26  Aug 15. Milestones: April 26  May 22 (4 weeks): Precoding stage. Open communication with my mentor, refine the project's plan and requirements, understand the community's code styling requirements, expand the knowledge on Hadoop and Mahout internals. Thoroughly familiarize with the classes within the classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and math packages. Write unit tests against the exisitng Mahout code. May 23  June 3 (2 weeks): Work on Driver. Implement, test and document the class HmmDriver by extending the AbstractJob class and by reusing the code from the KMeansDriver class. June 3  July 1 (4 weeks): Work on Mapper. Implement, test and document the class HmmMapper. The HmmMapper class will include setup() and map() methods. The setup() method will read in the HmmModel and the parameter values obtained from the previous iteration. The map() method will call the HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() and complete the Expectation step partially. July 1  July 15 (2 weeks): Work on Reducer. Implement, test and document the class HmmReducer. The reducer will complete the Expectation step by summing over all the occurences emitted by the mappers for the three optimization problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if possible. Also, midterm review. July 15  July 29 (2 weeks): Work on Combiner. Implement, test and document the class HmmCombiner. The combiner will reduce the network traffic and improve efficiency by aggregating the values for each of the three keys corresponding to each of the optimization problems for the Maximization stage in reducers. Look at the possibility of code reuse from the KMeansCombiner class. July 29  August 15 (2 weeks): Final touches. Test the mapper, reducer, combiner and driver together. Give an example demonstrating the new parallel BW algorithm by employing the partsofspeech tagger data set also used by the sequential BW [4]. Tidy up code and fix loose ends, finish wiki documentation. Additional Information: I am in the final stages of finishing my Master's degree in Electrical and Computer Engineering from the University of Massachusetts Amherst. Working under the guidance of Prof. Wayne Burleson, as part of my Master's research work, I have applied the theory of Markov Decision Process (MDP) to increase the duration of service of mobile computers. This semester I am involved with two course projects involving machine learning over large data sets. In the Bioinformatics class, I am mining the RCSB Protein Data Bank [5] to learn the dependence of side chain geometry on a protein's secondary structure, and comparing it with the Dynamic Bayesian Network approach used in [6]. In another project for the Online Social Networks class, I am using reinforcement learning to build an online recommendation system by reformulating MDP optimal policy search as an EM problem [7] and employing Map Reduce (extending Mahout) to arrive at it in a scalable, distributed manner. I owe much to the open source community as all my research experiments have only been possible due to the freely available Linux distributions, performance analyzers, scripting languages and documentation. Since joining the Apache dev mailing list, I have found the Apache Mahout's developer community vibrant, helpful and welcoming. If selected, I feel that the GSOC 2011 project will be a great learning experience for me from both a technical and professional standpoint and also allow me to contribute within my modest means to the overall spirit of open source programming. References: [1] A tutorial on hidden markov models and selected applications in speech recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 (1989), pp. 257286. [2] MapReduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. In NIPS (2006), pp. 281288. [3] DataIntensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. Morgan & Claypool 2010. [4] http://flexcrfs.sourceforge.net/#Case_Study [5] http://www.rcsb.org/pdb/home/home.do [6] Beyond rotamers: a generative, probabilistic model of side chains in proteins by Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE, Hamelryck T. BMC Bioinformatics. 2010 Jun 5. [7] M. Toussaint and A. Storkey. Probabilistic inference for solving discrete and continuous state Markov Decision Processes. In ICML, 2006. 
Proposal Title: BaumWelch Algorithm on MapReduce for Parallel Hidden Markov Model Training.
Student Name: Dhruv Kumar Student Email: dkumar@ecs.umass.edu Organization/Project: Apache Mahout Assigned Mentor: Proposal Abstract: The BaumWelch algorithm is commonly used for training a Hidden Markov Model as it is numerically stable and provides a guaranteed discovery of the Maximum Likelihood Estimator in the presence of incomplete data. Currently, Apache Mahout has a sequential implementation of the BaumWelch which cannot be scaled to train over large data sets. This project proposes to extend the sequential implementation of the BaumWelch to a parallel, distributed version using the Map Reduce programming framework to allow scalable Hidden Markov Model training. Detailed Description: Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool for applications generating temporal or spatial sequential data. Their relative simplicity of implementation and their ability to discover latent domain knowledge have made them very popular in fields such as DNA sequence alignment, handwriting analysis, voice recognition, computer vision and partsofspeech tagging. A HMM is defined as a tuple (S, O, Theta) where S is a finite set of unobservable, hidden states emitting symbols from a finite observable vocabulary set O according to a probabilistic model Theta. The parameters of the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic transition matrix of the hidden states of size S X S. The elements a_(i,j) of A specify the probability of transitioning from a state i to state j. Matrix B is a size S X O stochastic symbol emission matrix whose elements b_(s, o) provide the probability that a symbol o will be emitted from the hidden state s. The elements pi_(s) of the S length vector Pi determine the probability that the system starts in the hidden state s. The transitions of hidden states are unobservable and follow the Markov property of memorylessness. Rabiner [1] defined three main problems for HMMs: 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the observation sequence, determine the probability that the model generated the observed sequence. This is useful for determining the quality of the model and is solved using the so called Forward algorithm. 2. Decoding: Given the complete model (S, O, Theta) and an observation sequence, determine the hidden state sequence which generated the observed sequence. This can be viewed as an inference problem where the model and observed sequence are used to predict the value of the unobservable random variables. The backward algorithm, also known as the Viterbi decoding algorithm is used for predicting the hidden state sequence. 3. Training: Given the set of hidden states S, the set of observation vocabulary O and the observation sequence, determine the parameters (A, B, Pi) of the model Theta. This problem can be viewed as a statistical machine learning problem of model fitting to a large set of training data. The BaumWelch (BW) algorithm (also called the ForwardBackward algorithm) and the Viterbi training algorithm are commonly used for model fitting. In general, the quality of HMM training can be improved by employing large training vectors but currently, Mahout only supports sequential versions of HMM trainers which are incapable of scaling. Among the Viterbi and the BaumWelch training methods, the BaumWelch algorithm is slower but more accurate, and a better candidate for a parallel implementation for two reasons: (1) The BW is more numerically stable and provides a guaranteed local maximum of the Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In Viterbi training, the MLE is approximated in order to reduce computation time. (2) The BW belongs to the general class of Expectation Maximization (EM) algorithms which naturally fit into the Map Reduce framework [2]. Hence, this project proposes to extend Mahout's current sequential implementation of the BaumWelch HMM trainer to a scalable, distributed case. Since the distributed version of the BW will use the sequential implementations of the Forward and the Backward algorithms to compute the alpha and the beta factors in each iteration, a lot of existing HMM training code will be reused. Specifically, the parallel implementation of the BW algorithm on Map Reduce has been elaborated at great length in [3] by viewing it as a specific case of the ExpectationMaximization algorithm and will be followed for implementation in this project. The BW EM algorithm iteratively refines the model's parameters and consists of two distinct steps in each iterationExpectation and Maximization. In the distributed case, the Expectation step is computed by the mappers and the reducers, while the Maximization is handled by the reducers. Starting from an initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is input to the algorithm, and the end result Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a user specified convergence condition expressed as a fixpoint or when the number of iterations exceeds a user defined value. Expectation computes the posterior probability of each latent variable for each observed variable, weighed by the relative frequency of the observed variable in the input split. The mappers process independent training instances and emit expected state transition and emission counts using the forwardbackward algorithm. The reducers finish Expectation by aggregating the expected counts. The input to a mapper consists of (k, v_o) pairs where k is a unique key and v_o is a string of observed symbols. For each training instance, the mappers emit the same set of keys corresponding to the following three optimization problems to be solved during the Maximization, and their values in a hashmap: (1) Expected number of times a hidden state is reached (Pi). (2) Number of times each observable symbol is generated by each hidden state (B). (3) Number of transitions between each pair of states in the hidden state space (A). The M step computes the updated Theta(i+1) from the values generated during the E part. This involves aggregating the values (as hashmaps) for each key corresponding to one of the optimization problems. The aggregation summarizes the statistics necessary to compute a subset of the parameters for the next EM iteration. The optimal parameters for the next iteration are arrived by computing the relative frequency of each event with respect to its expected count at the current iteration. The emitted optimal parameters by each reducer are written to the HDFS and are fed to the mappers in the next iteration. The project can be subdivided into distinct tasks of programming, testing and documenting the driver, mapper, reducer and the combiner with the Expectation and Maximization parts split between them. For each of these tasks, a new class will be programmed, unit tested and documented within the org.apache.mahout.classifier.sequencelearning.hmm package. A list of milestones, associated deliverable and high level implementation details is given below. Timeline: April 26  Aug 15. Milestones: April 26  May 22 (4 weeks): Precoding stage. Open communication with my mentor, refine the project's plan and requirements, understand the community's code styling requirements, expand the knowledge on Hadoop and Mahout internals. Thoroughly familiarize with the classes within the classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and math packages. May 23  June 3 (2 weeks): Work on Driver. Implement, test and document the class HmmDriver by extending the AbstractJob class and by reusing the code from the KMeansDriver class. June 3  July 1 (4 weeks): Work on Mapper. Implement, test and document the class HmmMapper. The HmmMapper class will include setup() and map() methods. The setup() method will read in the HmmModel and the parameter values obtained from the previous iteration. The map() method will call the HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() and complete the Expectation step partially. July 1  July 15 (2 weeks): Work on Reducer. Implement, test and document the class HmmReducer. The reducer will complete the Expectation step by summing over all the occurences emitted by the mappers for the three optimization problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if possible. Also, midterm review. July 15  July 29 (2 weeks): Work on Combiner. Implement, test and document the class HmmCombiner. The combiner will reduce the network traffic and improve efficiency by aggregating the values for each of the three keys corresponding to each of the optimization problems for the Maximization stage in reducers. Look at the possibility of code reuse from the KMeansCombiner class. July 29  August 15 (2 weeks): Final touches. Test the mapper, reducer, combiner and driver together. Give an example demonstrating the new parallel BW algorithm by employing the partsofspeech tagger data set also used by the sequential BW [4]. Tidy up code and fix loose ends, finish wiki documentation. Additional Information: I am in the final stages of finishing my Master's degree in Electrical and Computer Engineering from the University of Massachusetts Amherst. Working under the guidance of Prof. Wayne Burleson, as part of my Master's research work, I have applied the theory of Markov Decision Process (MDP) to increase the duration of service of mobile computers. This semester I am involved with two course projects involving machine learning over large data sets. In the Bioinformatics class, I am mining the RCSB Protein Data Bank [5] to learn the dependence of side chain geometry on a protein's secondary structure, and comparing it with the Dynamic Bayesian Network approach used in [6]. In another project for the Online Social Networks class, I am using reinforcement learning to build an online recommendation system by reformulating MDP optimal policy search as an EM problem [7] and employing Map Reduce (extending Mahout) to arrive at it in a scalable, distributed manner. I owe much to the open source community as all my research experiments have only been possible due to the freely available Linux distributions, performance analyzers, scripting languages and documentation. Since joining the Apache dev mailing list, I have found the Apache Mahout's developer community vibrant, helpful and welcoming. If selected, I feel that the GSOC 2011 project will be a great learning experience for me from both a technical and professional standpoint and also allow me to contribute within my modest means to the overall spirit of open source programming. References: [1] A tutorial on hidden markov models and selected applications in speech recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 (1989), pp. 257286. [2] MapReduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. In NIPS (2006), pp. 281288. [3] DataIntensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. Morgan & Claypool 2010. [4] http://flexcrfs.sourceforge.net/#Case_Study [5] http://www.rcsb.org/pdb/home/home.do [6] Beyond rotamers: a generative, probabilistic model of side chains in proteins by Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE, Hamelryck T. BMC Bioinformatics. 2010 Jun 5. [7] M. Toussaint and A. Storkey. Probabilistic inference for solving discrete and continuous state Markov Decision Processes. In ICML, 2006. 
Description 
Proposal Title: BaumWelch Algorithm on MapReduce for Parallel Hidden Markov Model Training.
Student Name: Dhruv Kumar Student Email: dkumar@ecs.umass.edu Organization/Project: Apache Mahout Assigned Mentor: Proposal Abstract: The BaumWelch algorithm is commonly used for training a Hidden Markov Model as it is numerically stable and provides a guaranteed discovery of the Maximum Likelihood Estimator in the presence of incomplete data. Currently, Apache Mahout has a sequential implementation of the BaumWelch which cannot be scaled to train over large data sets. This project proposes to extend the sequential implementation of the BaumWelch to a parallel, distributed version using the Map Reduce programming framework to allow scalable Hidden Markov Model training. Detailed Description: Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool for applications generating temporal or spatial sequential data. Their relative simplicity of implementation and their ability to discover latent domain knowledge have made them very popular in fields such as DNA sequence alignment, handwriting analysis, voice recognition, computer vision and partsofspeech tagging. A HMM is defined as a tuple (S, O, Theta) where S is a finite set of unobservable, hidden states emitting symbols from a finite observable vocabulary set O according to a probabilistic model Theta. The parameters of the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic transition matrix of the hidden states of size S X S. The elements a_(i,j) of A specify the probability of transitioning from a state i to state j. Matrix B is a size S X O stochastic symbol emission matrix whose elements b_(s, o) provide the probability that a symbol o will be emitted from the hidden state s. The elements pi_(s) of the S length vector Pi determine the probability that the system starts in the hidden state s. The transitions of hidden states are unobservable and follow the Markov property of memorylessness. Rabiner [1] defined three main problems for HMMs: 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the observation sequence, determine the probability that the model generated the observed sequence. This is useful for determining the quality of the model and is solved using the so called Forward algorithm. 2. Decoding: Given the complete model (S, O, Theta) and an observation sequence, determine the hidden state sequence which generated the observed sequence. This can be viewed as an inference problem where the model and observed sequence are used to predict the value of the unobservable random variables. The backward algorithm, also known as the Viterbi decoding algorithm is used for predicting the hidden state sequence. 3. Training: Given the set of hidden states S, the set of observation vocabulary O and the observation sequence, determine the parameters (A, B, Pi) of the model Theta. This problem can be viewed as a statistical machine learning problem of model fitting to a large set of training data. The BaumWelch (BW) algorithm (also called the ForwardBackward algorithm) and the Viterbi training algorithm are commonly used for model fitting. In general, the quality of HMM training can be improved by employing large training vectors but currently, Mahout only supports sequential versions of HMM trainers which are incapable of scaling. Among the Viterbi and the BaumWelch training methods, the BaumWelch algorithm is slower but more accurate, and a better candidate for a parallel implementation for two reasons: (1) The BW is more numerically stable and provides a guaranteed local maximum of the Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In Viterbi training, the MLE is approximated in order to reduce computation time. (2) The BW belongs to the general class of Expectation Maximization (EM) algorithms which naturally fit into the Map Reduce framework [2]. Hence, this project proposes to extend Mahout's current sequential implementation of the BaumWelch HMM trainer to a scalable, distributed case. Since the distributed version of the BW will use the sequential implementations of the Forward and the Backward algorithms to compute the alpha and the beta factors in each iteration, a lot of existing HMM training code will be reused. Specifically, the parallel implementation of the BW algorithm on Map Reduce has been elaborated at great length in [3] by viewing it as a specific case of the ExpectationMaximization algorithm and will be followed for implementation in this project. The BW EM algorithm iteratively refines the model's parameters and consists of two distinct steps in each iterationExpectation and Maximization. In the distributed case, the Expectation step is computed by the mappers and the reducers, while the Maximization is handled by the reducers. Starting from an initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is input to the algorithm, and the end result Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a user specified convergence condition expressed as a fixpoint or when the number of iterations exceeds a user defined value. Expectation computes the posterior probability of each latent variable for each observed variable, weighed by the relative frequency of the observed variable in the input split. The mappers process independent training instances and emit expected state transition and emission counts using the forwardbackward algorithm. The reducers finish Expectation by aggregating the expected counts. The input to a mapper consists of (k, v_o) pairs where k is a unique key and v_o is a string of observed symbols. For each training instance, the mappers emit the same set of keys corresponding to the following three optimization problems to be solved during the Maximization, and their values in a hashmap: (1) Expected number of times a hidden state is reached (Pi). (2) Number of times each observable symbol is generated by each hidden state (B). (3) Number of transitions between each pair of states in the hidden state space (A). The M step computes the updated Theta(i+1) from the values generated during the E part. This involves aggregating the values (as hashmaps) for each key corresponding to one of the optimization problems. The aggregation summarizes the statistics necessary to compute a subset of the parameters for the next EM iteration. The optimal parameters for the next iteration are arrived by computing the relative frequency of each event with respect to its expected count at the current iteration. The emitted optimal parameters by each reducer are written to the HDFS and are fed to the mappers in the next iteration. The project can be subdivided into distinct tasks of programming, testing and documenting the driver, mapper, reducer and the combiner with the Expectation and Maximization parts split between them. For each of these tasks, a new class will be programmed, unit tested and documented within the org.apache.mahout.classifier.sequencelearning.hmm package. A list of milestones, associated deliverable and high level implementation details is given below. Timeline: April 26  Aug 15. Milestones: April 26  May 22 (4 weeks): This is the precoding stage. Open communication with my mentor, refine the project's plan and requirements, understand the community's code styling requirements, expand the knowledge on Hadoop and Mahout internals. Thoroughly familiarize with the classes within the classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and math packages. Write unit tests against the exisitng Mahout code. May 23  June 3 (2 weeks): Implement, test and document the class HmmDriver by extending the AbstractJob class and by reusing the code from the KMeansDriver class. June 3  July 1 (4 weeks): Implement, test and document the class HmmMapper. The HmmMapper class will include setup() and map() methods. The setup() method will read in the HmmModel and the parameter values obtained from the previous iteration. The map() method will call the HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() and complete the Expectation step partially. July 1  July 15 (2 weeks): Implement, test and document the class HmmReducer. The reducer will complete the Expectation step by summing over all the occurences emitted by the mappers for the three optimization problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if possible. July 15  July 29 (2 weeks): Implement, test and document the class HmmCombiner. The combiner will reduce the network traffic and improve efficiency by aggregating the values for each of the three keys corresponding to each of the optimization problems for the Maximization stage in reducers. Look at the possibility of code reuse from the KMeansCombiner class. July 29  August 15 (2 weeks): Test the mapper, reducer, combiner and driver together. Give an example demonstrating the new parallel BW algorithm by employing the partsofspeech tagger data set also used by the sequential BW [4]. Tidy up code and fix loose ends, finish wiki documentation. Additional Information: I am in the final stages of finishing my Master's degree in Electrical and Computer Engineering from the University of Massachusetts Amherst. Working under the guidance of Prof. Wayne Burleson, as part of my Master's research work, I have applied the theory of Markov Decision Process (MDP) to increase the duration of service of mobile computers. This semester I am involved with two course projects involving machine learning over large data sets. In the Bioinformatics class, I am mining the RCSB Protein Data Bank [5] to learn the dependence of side chain geometry on a protein's secondary structure, and comparing it with the Dynamic Bayesian Network approach used in [6]. In another project for the Online Social Networks class, I am using reinforcement learning to build an online recommendation system by reformulating MDP optimal policy search as an EM problem [7] and employing Map Reduce (extending Mahout) to arrive at it in a scalable, distributed manner. I owe much to the open source community as all my research experiments have only been possible due to the freely available Linux distributions, performance analyzers, scripting languages and documentation. Since joining the Apache dev mailing list, I have found the Apache Mahout's developer community vibrant, helpful and welcoming. If selected, I feel that the GSOC 2011 project will be a great learning experience for me from both a technical and professional standpoint and also allow me to contribute within my modest means to the overall spirit of open source programming. References: [1] A tutorial on hidden markov models and selected applications in speech recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 (1989), pp. 257286. [2] MapReduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. In NIPS (2006), pp. 281288. [3] DataIntensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. Morgan & Claypool 2010. [4] http://flexcrfs.sourceforge.net/#Case_Study [5] http://www.rcsb.org/pdb/home/home.do [6] Beyond rotamers: a generative, probabilistic model of side chains in proteins by Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE, Hamelryck T. BMC Bioinformatics. 2010 Jun 5. [7] M. Toussaint and A. Storkey. Probabilistic inference for solving discrete and continuous state Markov Decision Processes. In ICML, 2006. 
Proposal Title: BaumWelch Algorithm on MapReduce for Parallel Hidden Markov Model Training.
Student Name: Dhruv Kumar Student Email: dkumar@ecs.umass.edu Organization/Project: Apache Mahout Assigned Mentor: Proposal Abstract: The BaumWelch algorithm is commonly used for training a Hidden Markov Model as it is numerically stable and provides a guaranteed discovery of the Maximum Likelihood Estimator in the presence of incomplete data. Currently, Apache Mahout has a sequential implementation of the BaumWelch which cannot be scaled to train over large data sets. This project proposes to extend the sequential implementation of the BaumWelch to a parallel, distributed version using the Map Reduce programming framework to allow scalable Hidden Markov Model training. Detailed Description: Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool for applications generating temporal or spatial sequential data. Their relative simplicity of implementation and their ability to discover latent domain knowledge have made them very popular in fields such as DNA sequence alignment, handwriting analysis, voice recognition, computer vision and partsofspeech tagging. A HMM is defined as a tuple (S, O, Theta) where S is a finite set of unobservable, hidden states emitting symbols from a finite observable vocabulary set O according to a probabilistic model Theta. The parameters of the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic transition matrix of the hidden states of size S X S. The elements a_(i,j) of A specify the probability of transitioning from a state i to state j. Matrix B is a size S X O stochastic symbol emission matrix whose elements b_(s, o) provide the probability that a symbol o will be emitted from the hidden state s. The elements pi_(s) of the S length vector Pi determine the probability that the system starts in the hidden state s. The transitions of hidden states are unobservable and follow the Markov property of memorylessness. Rabiner [1] defined three main problems for HMMs: 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the observation sequence, determine the probability that the model generated the observed sequence. This is useful for determining the quality of the model and is solved using the so called Forward algorithm. 2. Decoding: Given the complete model (S, O, Theta) and an observation sequence, determine the hidden state sequence which generated the observed sequence. This can be viewed as an inference problem where the model and observed sequence are used to predict the value of the unobservable random variables. The backward algorithm, also known as the Viterbi decoding algorithm is used for predicting the hidden state sequence. 3. Training: Given the set of hidden states S, the set of observation vocabulary O and the observation sequence, determine the parameters (A, B, Pi) of the model Theta. This problem can be viewed as a statistical machine learning problem of model fitting to a large set of training data. The BaumWelch (BW) algorithm (also called the ForwardBackward algorithm) and the Viterbi training algorithm are commonly used for model fitting. In general, the quality of HMM training can be improved by employing large training vectors but currently, Mahout only supports sequential versions of HMM trainers which are incapable of scaling. Among the Viterbi and the BaumWelch training methods, the BaumWelch algorithm is slower but more accurate, and a better candidate for a parallel implementation for two reasons: (1) The BW is more numerically stable and provides a guaranteed local maximum of the Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In Viterbi training, the MLE is approximated in order to reduce computation time. (2) The BW belongs to the general class of Expectation Maximization (EM) algorithms which naturally fit into the Map Reduce framework [2]. Hence, this project proposes to extend Mahout's current sequential implementation of the BaumWelch HMM trainer to a scalable, distributed case. Since the distributed version of the BW will use the sequential implementations of the Forward and the Backward algorithms to compute the alpha and the beta factors in each iteration, a lot of existing HMM training code will be reused. Specifically, the parallel implementation of the BW algorithm on Map Reduce has been elaborated at great length in [3] by viewing it as a specific case of the ExpectationMaximization algorithm and will be followed for implementation in this project. The BW EM algorithm iteratively refines the model's parameters and consists of two distinct steps in each iterationExpectation and Maximization. In the distributed case, the Expectation step is computed by the mappers and the reducers, while the Maximization is handled by the reducers. Starting from an initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is input to the algorithm, and the end result Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a user specified convergence condition expressed as a fixpoint or when the number of iterations exceeds a user defined value. Expectation computes the posterior probability of each latent variable for each observed variable, weighed by the relative frequency of the observed variable in the input split. The mappers process independent training instances and emit expected state transition and emission counts using the forwardbackward algorithm. The reducers finish Expectation by aggregating the expected counts. The input to a mapper consists of (k, v_o) pairs where k is a unique key and v_o is a string of observed symbols. For each training instance, the mappers emit the same set of keys corresponding to the following three optimization problems to be solved during the Maximization, and their values in a hashmap: (1) Expected number of times a hidden state is reached (Pi). (2) Number of times each observable symbol is generated by each hidden state (B). (3) Number of transitions between each pair of states in the hidden state space (A). The M step computes the updated Theta(i+1) from the values generated during the E part. This involves aggregating the values (as hashmaps) for each key corresponding to one of the optimization problems. The aggregation summarizes the statistics necessary to compute a subset of the parameters for the next EM iteration. The optimal parameters for the next iteration are arrived by computing the relative frequency of each event with respect to its expected count at the current iteration. The emitted optimal parameters by each reducer are written to the HDFS and are fed to the mappers in the next iteration. The project can be subdivided into distinct tasks of programming, testing and documenting the driver, mapper, reducer and the combiner with the Expectation and Maximization parts split between them. For each of these tasks, a new class will be programmed, unit tested and documented within the org.apache.mahout.classifier.sequencelearning.hmm package. A list of milestones, associated deliverable and high level implementation details is given below. Timeline: April 26  Aug 15. Milestones: April 26  May 22 (4 weeks): Precoding stage. Open communication with my mentor, refine the project's plan and requirements, understand the community's code styling requirements, expand the knowledge on Hadoop and Mahout internals. Thoroughly familiarize with the classes within the classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and math packages. Write unit tests against the exisitng Mahout code. May 23  June 3 (2 weeks): Work on Driver. Implement, test and document the class HmmDriver by extending the AbstractJob class and by reusing the code from the KMeansDriver class. June 3  July 1 (4 weeks): Work on Mapper. Implement, test and document the class HmmMapper. The HmmMapper class will include setup() and map() methods. The setup() method will read in the HmmModel and the parameter values obtained from the previous iteration. The map() method will call the HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() and complete the Expectation step partially. July 1  July 15 (2 weeks): Work on Reducer. Implement, test and document the class HmmReducer. The reducer will complete the Expectation step by summing over all the occurences emitted by the mappers for the three optimization problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if possible. Also, midterm review. July 15  July 29 (2 weeks): Work on Combiner. Implement, test and document the class HmmCombiner. The combiner will reduce the network traffic and improve efficiency by aggregating the values for each of the three keys corresponding to each of the optimization problems for the Maximization stage in reducers. Look at the possibility of code reuse from the KMeansCombiner class. July 29  August 15 (2 weeks): Final touches. Test the mapper, reducer, combiner and driver together. Give an example demonstrating the new parallel BW algorithm by employing the partsofspeech tagger data set also used by the sequential BW [4]. Tidy up code and fix loose ends, finish wiki documentation. Additional Information: I am in the final stages of finishing my Master's degree in Electrical and Computer Engineering from the University of Massachusetts Amherst. Working under the guidance of Prof. Wayne Burleson, as part of my Master's research work, I have applied the theory of Markov Decision Process (MDP) to increase the duration of service of mobile computers. This semester I am involved with two course projects involving machine learning over large data sets. In the Bioinformatics class, I am mining the RCSB Protein Data Bank [5] to learn the dependence of side chain geometry on a protein's secondary structure, and comparing it with the Dynamic Bayesian Network approach used in [6]. In another project for the Online Social Networks class, I am using reinforcement learning to build an online recommendation system by reformulating MDP optimal policy search as an EM problem [7] and employing Map Reduce (extending Mahout) to arrive at it in a scalable, distributed manner. I owe much to the open source community as all my research experiments have only been possible due to the freely available Linux distributions, performance analyzers, scripting languages and documentation. Since joining the Apache dev mailing list, I have found the Apache Mahout's developer community vibrant, helpful and welcoming. If selected, I feel that the GSOC 2011 project will be a great learning experience for me from both a technical and professional standpoint and also allow me to contribute within my modest means to the overall spirit of open source programming. References: [1] A tutorial on hidden markov models and selected applications in speech recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 (1989), pp. 257286. [2] MapReduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. In NIPS (2006), pp. 281288. [3] DataIntensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. Morgan & Claypool 2010. [4] http://flexcrfs.sourceforge.net/#Case_Study [5] http://www.rcsb.org/pdb/home/home.do [6] Beyond rotamers: a generative, probabilistic model of side chains in proteins by Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE, Hamelryck T. BMC Bioinformatics. 2010 Jun 5. [7] M. Toussaint and A. Storkey. Probabilistic inference for solving discrete and continuous state Markov Decision Processes. In ICML, 2006. 
Description 
Proposal Title: BaumWelch Algorithm on MapReduce for Parallel Hidden Markov Model Training.
Student Name: Dhruv Kumar Student Email: dkumar@ecs.umass.edu Organization/Project: Apache Mahout Assigned Mentor: Proposal Abstract: The BaumWelch algorithm is commonly used for training a Hidden Markov Model as it is numerically stable and provides a guaranteed discovery of the Maximum Likelihood Estimator in the presence of incomplete data. Currently, Apache Mahout has a sequential implementation of the BaumWelch which cannot be scaled to train over large data sets. This project proposes to extend the sequential implementation of the BaumWelch to a parallel, distributed version using the Map Reduce programming framework to allow scalable Hidden Markov Model training. Detailed Description: Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool for applications generating temporal or spatial sequential data. Their relative simplicity of implementation and their ability to discover latent domain knowledge have made them very popular in fields such as DNA sequence alignment, handwriting analysis, voice recognition, computer vision and partsofspeech tagging. A HMM is defined as a tuple (S, O, Theta) where S is a finite set of unobservable, hidden states emitting symbols from a finite observable vocabulary set O according to a probabilistic model Theta. The parameters of the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic transition matrix of the hidden states of size S X S. The elements a_(i,j) of A specify the probability of transitioning from a state i to state j. Matrix B is a size S X O stochastic symbol emission matrix whose elements b_(s, o) provide the probability that a symbol o will be emitted from the hidden state s. The elements pi_(s) of the S length vector Pi determine the probability that the system starts in the hidden state s. The transitions of hidden states are unobservable and follow the Markov property of memorylessness. Rabiner [1] defined three main problems for HMMs: 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the observation sequence, determine the probability that the model generated the observed sequence. This is useful for determining the quality of the model and is solved using the so called Forward algorithm. 2. Decoding: Given the complete model (S, O, Theta) and an observation sequence, determine the hidden state sequence which generated the observed sequence. This can be viewed as an inference problem where the model and observed sequence are used to predict the value of the unobservable random variables. The backward algorithm, also known as the Viterbi decoding algorithm is used for predicting the hidden state sequence. 3. Training: Given the set of hidden states S, the set of observation vocabulary O and the observation sequence, determine the parameters (A, B, Pi) of the model Theta. This problem can be viewed as a statistical machine learning problem of model fitting to a large set of training data. The BaumWelch (BW) algorithm (also called the ForwardBackward algorithm) and the Viterbi training algorithm are commonly used for model fitting. In general, the quality of HMM training can be improved by employing large training vectors but currently, Mahout only supports sequential versions of HMM trainers which are incapable of scaling. Among the Viterbi and the BaumWelch training methods, the BaumWelch algorithm is slower but more accurate, and a better candidate for a parallel implementation for two reasons: (1) The BW is more numerically stable and provides a guaranteed local maximum of the Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In Viterbi training, the MLE is approximated in order to reduce computation time. (2) The BW belongs to the general class of Expectation Maximization (EM) algorithms which naturally fit into the Map Reduce framework [2]. Hence, this project proposes to extend Mahout's current sequential implementation of the BaumWelch HMM trainer to a scalable, distributed case. Since the distributed version of the BW will use the sequential implementations of the Forward and the Backward algorithms to compute the alpha and the beta factors in each iteration, a lot of existing HMM training code will be reused. Specifically, the parallel implementation of the BW algorithm on Map Reduce has been elaborated at great length in [3] by viewing it as a specific case of the ExpectationMaximization algorithm and will be followed for implementation in this project. The BW EM algorithm iteratively refines the model's parameters and consists of two distinct steps in each iterationExpectation and Maximization. In the distributed case, the Expectation step is computed by the mappers and the reducers, while the Maximization is handled by the reducers. Starting from an initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is input to the algorithm, and the end result Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a user specified convergence condition expressed as a fixpoint or when the number of iterations exceeds a user defined value. Expectation computes the posterior probability of each latent variable for each observed variable, weighed by the relative frequency of the observed variable in the input split. The mappers process independent training instances and emit expected state transition and emission counts using the forwardbackward algorithm. The reducers finish Expectation by aggregating the expected counts. The input to a mapper consists of (k, v_o) pairs where k is a unique key and v_o is a string of observed symbols. For each training instance, the mappers emit the same set of keys corresponding to the following three optimization problems to be solved during the Maximization, and their values in a hashmap: (1) Expected number of times a hidden state is reached (Pi). (2) Number of times each observable symbol is generated by each hidden state (B). (3) Number of transitions between each pair of states in the hidden state space (A). The M step computes the updated Theta(i+1) from the values generated during the E part. This involves aggregating the values (as hashmaps) for each key corresponding to one of the optimization problems. The aggregation summarizes the statistics necessary to compute a subset of the parameters for the next EM iteration. The optimal parameters for the next iteration are arrived by computing the relative frequency of each event with respect to its expected count at the current iteration. The emitted optimal parameters by each reducer are written to the HDFS and are fed to the mappers in the next iteration. The project can be subdivided into distinct tasks of programming, testing and documenting the driver, mapper, reducer and the combiner with the Expectation and Maximization parts split between them. For each of these tasks, a new class will be programmed, unit tested and documented within the org.apache.mahout.classifier.sequencelearning.hmm package. A list of milestones, associated deliverable and high level implementation details is given below. Timeline: April 26  Aug 15. Milestones: April 26  May 22 (4 weeks): This is the precoding stage. Open communication with my mentor, refine the project's plan and requirements, understand the community's code styling requirements, expand the knowledge on Hadoop and Mahout internals. Thoroughly familiarize with the classes within the classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and math packages. Write unit tests against the exisitng Mahout code. May 23  June 3 (2 weeks): Implement, test and document the class HmmDriver by extending the AbstractJob class and by reusing the code from the KMeansDriver class. June 3  July 1 (4 weeks): Implement, test and document the class HmmMapper. The HmmMapper class will include setup() and map() methods. The setup() method will read in the HmmModel and the parameter values obtained from the previous iteration. The map() method will call the HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() and complete the Expectation step partially. July 1  July 15 (2 weeks): Implement, test and document the class HmmReducer. The reducer will complete the Espectation step by summing over all the occurences emitted by the mappers for the three optimization problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if possible. July 15  July 29 (2 weeks): Implement, test and document the class HmmCombiner. The combiner will reduce the network traffic and improve efficiency by aggregating the values for each of the three keys corresponding to each of the optimization problems for the Maximization stage in reducers. Look at the possibility of code reuse from the KMeansCombiner class. July 29  August 15 (2 weeks): Test the mapper, reducer, combiner and driver together. Give an example demonstrating the new parallel BW algorithm by employing the partsofspeech tagger data set also used by the sequential BW [4]. Tidy up code and fix loose ends, finish wiki documentation. Additional Information: I am in the final stages of finishing my Master's degree in Electrical and Computer Engineering from the University of Massachusetts Amherst. Working under the guidance of Prof. Wayne Burleson, as part of my Master's research work, I have applied the theory of Markov Decision Process (MDP) to increase the duration of service of mobile computers. This semester I am involved with two course projects involving machine learning over large data sets. In a Bioinformatics class I am mining the RCSB Protein Data Bank to learn the dependence of side chain geometry on the protein's secondary structure. In another project for the Online Social Networks class, I am building an online recommendation system using the MDP. I owe much to the open source community as all my research experiments have only been possible due to the freely available Linux distributions, performance analyzers, scripting languages and documentation. Since joining the Apache dev mailing list, I have found the Apache Mahout's developer community vibrant, helpful and welcoming. If selected, I feel that the GSOC 2011 project will be a great learning experience for me from both a technical and professional standpoint and also allow me to contribute within my modest means to the overall spirit of open source programming. References: [1] A tutorial on hidden markov models and selected applications in speech recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 (1989), pp. 257286. [2] MapReduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. In NIPS (2006), pp. 281288. [3] DataIntensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. Morgan & Claypool 2010. [4] http://flexcrfs.sourceforge.net/#Case_Study 
Proposal Title: BaumWelch Algorithm on MapReduce for Parallel Hidden Markov Model Training.
Student Name: Dhruv Kumar Student Email: dkumar@ecs.umass.edu Organization/Project: Apache Mahout Assigned Mentor: Proposal Abstract: The BaumWelch algorithm is commonly used for training a Hidden Markov Model as it is numerically stable and provides a guaranteed discovery of the Maximum Likelihood Estimator in the presence of incomplete data. Currently, Apache Mahout has a sequential implementation of the BaumWelch which cannot be scaled to train over large data sets. This project proposes to extend the sequential implementation of the BaumWelch to a parallel, distributed version using the Map Reduce programming framework to allow scalable Hidden Markov Model training. Detailed Description: Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool for applications generating temporal or spatial sequential data. Their relative simplicity of implementation and their ability to discover latent domain knowledge have made them very popular in fields such as DNA sequence alignment, handwriting analysis, voice recognition, computer vision and partsofspeech tagging. A HMM is defined as a tuple (S, O, Theta) where S is a finite set of unobservable, hidden states emitting symbols from a finite observable vocabulary set O according to a probabilistic model Theta. The parameters of the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic transition matrix of the hidden states of size S X S. The elements a_(i,j) of A specify the probability of transitioning from a state i to state j. Matrix B is a size S X O stochastic symbol emission matrix whose elements b_(s, o) provide the probability that a symbol o will be emitted from the hidden state s. The elements pi_(s) of the S length vector Pi determine the probability that the system starts in the hidden state s. The transitions of hidden states are unobservable and follow the Markov property of memorylessness. Rabiner [1] defined three main problems for HMMs: 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the observation sequence, determine the probability that the model generated the observed sequence. This is useful for determining the quality of the model and is solved using the so called Forward algorithm. 2. Decoding: Given the complete model (S, O, Theta) and an observation sequence, determine the hidden state sequence which generated the observed sequence. This can be viewed as an inference problem where the model and observed sequence are used to predict the value of the unobservable random variables. The backward algorithm, also known as the Viterbi decoding algorithm is used for predicting the hidden state sequence. 3. Training: Given the set of hidden states S, the set of observation vocabulary O and the observation sequence, determine the parameters (A, B, Pi) of the model Theta. This problem can be viewed as a statistical machine learning problem of model fitting to a large set of training data. The BaumWelch (BW) algorithm (also called the ForwardBackward algorithm) and the Viterbi training algorithm are commonly used for model fitting. In general, the quality of HMM training can be improved by employing large training vectors but currently, Mahout only supports sequential versions of HMM trainers which are incapable of scaling. Among the Viterbi and the BaumWelch training methods, the BaumWelch algorithm is slower but more accurate, and a better candidate for a parallel implementation for two reasons: (1) The BW is more numerically stable and provides a guaranteed local maximum of the Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In Viterbi training, the MLE is approximated in order to reduce computation time. (2) The BW belongs to the general class of Expectation Maximization (EM) algorithms which naturally fit into the Map Reduce framework [2]. Hence, this project proposes to extend Mahout's current sequential implementation of the BaumWelch HMM trainer to a scalable, distributed case. Since the distributed version of the BW will use the sequential implementations of the Forward and the Backward algorithms to compute the alpha and the beta factors in each iteration, a lot of existing HMM training code will be reused. Specifically, the parallel implementation of the BW algorithm on Map Reduce has been elaborated at great length in [3] by viewing it as a specific case of the ExpectationMaximization algorithm and will be followed for implementation in this project. The BW EM algorithm iteratively refines the model's parameters and consists of two distinct steps in each iterationExpectation and Maximization. In the distributed case, the Expectation step is computed by the mappers and the reducers, while the Maximization is handled by the reducers. Starting from an initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is input to the algorithm, and the end result Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a user specified convergence condition expressed as a fixpoint or when the number of iterations exceeds a user defined value. Expectation computes the posterior probability of each latent variable for each observed variable, weighed by the relative frequency of the observed variable in the input split. The mappers process independent training instances and emit expected state transition and emission counts using the forwardbackward algorithm. The reducers finish Expectation by aggregating the expected counts. The input to a mapper consists of (k, v_o) pairs where k is a unique key and v_o is a string of observed symbols. For each training instance, the mappers emit the same set of keys corresponding to the following three optimization problems to be solved during the Maximization, and their values in a hashmap: (1) Expected number of times a hidden state is reached (Pi). (2) Number of times each observable symbol is generated by each hidden state (B). (3) Number of transitions between each pair of states in the hidden state space (A). The M step computes the updated Theta(i+1) from the values generated during the E part. This involves aggregating the values (as hashmaps) for each key corresponding to one of the optimization problems. The aggregation summarizes the statistics necessary to compute a subset of the parameters for the next EM iteration. The optimal parameters for the next iteration are arrived by computing the relative frequency of each event with respect to its expected count at the current iteration. The emitted optimal parameters by each reducer are written to the HDFS and are fed to the mappers in the next iteration. The project can be subdivided into distinct tasks of programming, testing and documenting the driver, mapper, reducer and the combiner with the Expectation and Maximization parts split between them. For each of these tasks, a new class will be programmed, unit tested and documented within the org.apache.mahout.classifier.sequencelearning.hmm package. A list of milestones, associated deliverable and high level implementation details is given below. Timeline: April 26  Aug 15. Milestones: April 26  May 22 (4 weeks): This is the precoding stage. Open communication with my mentor, refine the project's plan and requirements, understand the community's code styling requirements, expand the knowledge on Hadoop and Mahout internals. Thoroughly familiarize with the classes within the classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and math packages. Write unit tests against the exisitng Mahout code. May 23  June 3 (2 weeks): Implement, test and document the class HmmDriver by extending the AbstractJob class and by reusing the code from the KMeansDriver class. June 3  July 1 (4 weeks): Implement, test and document the class HmmMapper. The HmmMapper class will include setup() and map() methods. The setup() method will read in the HmmModel and the parameter values obtained from the previous iteration. The map() method will call the HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() and complete the Expectation step partially. July 1  July 15 (2 weeks): Implement, test and document the class HmmReducer. The reducer will complete the Expectation step by summing over all the occurences emitted by the mappers for the three optimization problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if possible. July 15  July 29 (2 weeks): Implement, test and document the class HmmCombiner. The combiner will reduce the network traffic and improve efficiency by aggregating the values for each of the three keys corresponding to each of the optimization problems for the Maximization stage in reducers. Look at the possibility of code reuse from the KMeansCombiner class. July 29  August 15 (2 weeks): Test the mapper, reducer, combiner and driver together. Give an example demonstrating the new parallel BW algorithm by employing the partsofspeech tagger data set also used by the sequential BW [4]. Tidy up code and fix loose ends, finish wiki documentation. Additional Information: I am in the final stages of finishing my Master's degree in Electrical and Computer Engineering from the University of Massachusetts Amherst. Working under the guidance of Prof. Wayne Burleson, as part of my Master's research work, I have applied the theory of Markov Decision Process (MDP) to increase the duration of service of mobile computers. This semester I am involved with two course projects involving machine learning over large data sets. In the Bioinformatics class, I am mining the RCSB Protein Data Bank [5] to learn the dependence of side chain geometry on a protein's secondary structure, and comparing it with the Dynamic Bayesian Network approach used in [6]. In another project for the Online Social Networks class, I am using reinforcement learning to build an online recommendation system by reformulating MDP optimal policy search as an EM problem [7] and employing Map Reduce (extending Mahout) to arrive at it in a scalable, distributed manner. I owe much to the open source community as all my research experiments have only been possible due to the freely available Linux distributions, performance analyzers, scripting languages and documentation. Since joining the Apache dev mailing list, I have found the Apache Mahout's developer community vibrant, helpful and welcoming. If selected, I feel that the GSOC 2011 project will be a great learning experience for me from both a technical and professional standpoint and also allow me to contribute within my modest means to the overall spirit of open source programming. References: [1] A tutorial on hidden markov models and selected applications in speech recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 (1989), pp. 257286. [2] MapReduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. In NIPS (2006), pp. 281288. [3] DataIntensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. Morgan & Claypool 2010. [4] http://flexcrfs.sourceforge.net/#Case_Study [5] http://www.rcsb.org/pdb/home/home.do [6] Beyond rotamers: a generative, probabilistic model of side chains in proteins by Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE, Hamelryck T. BMC Bioinformatics. 2010 Jun 5. [7] M. Toussaint and A. Storkey. Probabilistic inference for solving discrete and continuous state Markov Decision Processes. In ICML, 2006. 
Summary  Parallelization of BaumWelch Algorithm for HMM Training  BaumWelch Algorithm on MapReduce for Parallel Hidden Markov Model Training. 
Issue Type  New Feature [ 2 ]  Task [ 3 ] 
Priority  Minor [ 4 ]  Major [ 3 ] 
Labels  gsoc gsoc2011 mahoutgsoc11 
Description 
Among the unsupervised learning methods available in the current sequential implementation of HMM training ( A parallel, MapReduce implementation of BW will allow scalable model learning over large data sets. The resulting model can be used for prediction using the current sequential implementation of the Viterbi decoding algorithm. BW is a general case of the Expectation Maximization (EM) algorithm and it is shown that all EM algorithms are candidates for a MapReduce implementation: http://wwwcs.stanford.edu/~ang/papers/nips06mapreducemulticore.pdf. Also, the kmeans clustering is an EM algorithm and has an existing MapReduce implementation in Mahout which hints at the possibility of code reuse to aid parallel BW development. The other possibility for a parallel implementation would be to use the notion of matrix probability as shown by Turin: http://tagh.de/tom/wpcontent/uploads/turin1998.pdf. Although non trivial, a parallel BW implementation would greatly add to the existing set of Mahout's HMM functionality.  Proposal Begin  Proposal Title: Implementation of the BaumWelch Algorithm for parallel Hidden Markov Model training on MapReduce. Student Name: Dhruv Kumar Student Email: dkumar@ecs.umass.edu Organization/Project: Apache Mahout Assigned Mentor: Proposal Abstract: The BaumWelch algorithm is commonly used for training a Hidden Markov Model as it is numerically stable and provides a guaranteed discovery of the Maximum Likelihood Estimator in the presence of incomplete data. Currently, Apache Mahout has a sequential implementation of the BaumWelch which cannot be scaled to train over large datasets. This project proposes to extend the sequential implementation of the BaumWelch to a parallel, distributed version using the Map Reduce programming framework to allow scalable Hidden Markov Model training. Detailed Description: Hidden Markov Models (HMM) are widely used as a probabilistic inference tool for applications generating temporal or spatial sequential data. Their relative simplicity of implementation and their ability to discover latent domain knowledge have made them very popular in fields such as DNA sequence alignment, handwriting analysis, voice recognition, computer vision and partsofspeech tagging. A HMM is defined as a tuple (S, O, Theta) where S is a finite set of unobservable, hidden states emitting symbols from a finite observable vocabulary set O according to a probabilistic model Theta. The parameters of the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic transition matrix of the hidden states of size S X S. The elements a_(i,j) of A specify the probability of transitioning from a state i to state j. Matrix B is a size S X O stochastic symbol emission matrix whose elements b_(s, o) provide the probability that a symbol o will be emitted from the hidden state s. The elements pi_(s) of the S length vector Pi determine the probability that the system starts in the hidden state s. The transitions of hidden states are unobservable and follow the Markov property of memorylessness. Rabiner [1] defined three main problems for HMMs: 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the observation sequence, determine the probability that the model generated the observed sequence. This is useful for determining the quality of the model and is solved using the so called Forward algorithm. 2. Decoding: Given the complete model (S, O, Theta) and an observation sequence, determine the hidden state sequence which generated the observed sequence. This can be viewed as an inference problem where the model and observed sequence are used to predict the value of the unobservable random variables. The backward algorithm, also known as the Viterbi decoding algorithm is used for predicting the hidden state sequence. 3. Training: Given the set of hidden states S, the set of observation vocabulary O and the observation sequence, determine the parameters (A, B, Pi) of the model Theta. This problem can be viewed as a statistical machine learning problem of model fitting to a large set of training data. The BaumWelch (BW) algorithm (also called the ForwardBackward algorithm) and the Viterbi training algorithm are commonly used for model fitting. Since most problems modeled by HMMs have a modest number of hidden and observable states, the sequential versions of the Forward and the Viterbi algorithms (currently implemented in Mahout) are sufficient for the evaluation and decoding purposes. However, often the training data is so large that a single compute node is incapable of handling it. Attempts to prune the training data can lead to lower quality model fitting and may miss out on crucial domain knowldege because of inferior posterior probabilities. Currently, Mahout only supports sequential HMM trainersViterbi and Baum Welch which are incapable of scaling to large data sets. This project proposes to extend Mahout's current sequential implementation of the Baum Welch HMM trainer to a scalable, distributed case. The distributed version of the BW will use the sequential implementations of the Forward and the Backward algorithms in each iteration, hence a lot of exisitng HMM training code will be reused. Among the two sequential HMM training methods, the BaumWelch (BW) or the ForwardBackward algorithm is superiand a better candidate for a parallel implementation for two reasons. (1) The BW is more numerically stable and provides guaranteed discovery of Maximum Likelihood Estimator (MLE) (albiet a local maximum) unlike Viterbi training where the MLE is approximated in order to reduce computation time. (2) The BW belongs to the general class of Expectation Maximization (EM) algorithms which naturally fit into the Map Reduce framework [2]. Specifically, the parallel implementation of the BW algorithm on Map Reduce has been elaborated at great length in [3] and will be followed in the remainder of this proposal. The BW EM algorithm iteratively refines the model's parameters and consists of two distinct steps in each iterationExpectation (E) and Maximization (M). In the distributed case, the E step is computed by the mappers and the reducers, while the M is handled by the reducers. Starting from an initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is input to the algorithm, and the end result Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a user specified convergence condition expressed as a fixpoint or when the number of iterations exceeds a user defined value. The E step computes the posterior probability of each latent varaiable for each observed variable, weighed by the relative frequency of the observered variable in the input split. The mappers process independent training instances and emit expected state transition and emission counts using the forwardbackward algorithm. The reducers finish E by aggregating the expected counts. The input to a mapper consists of (k, v_o) pairs where k is a unique key and v_o is a string of observerd symbols of length n. For each training instance, the mappers emit the same set of keys corresponding to the following three optimization problems to be solved during the M: (1) expected number of times a hidden state is reached, (2) the number of times each observable symbol is generated by each hidden state and, (3) the number of transitions between each pair of states in the hidden state space. The M step computes the updated Theta^(i+1) from the values generated during the E part. This involves aggregating the values obtained in the E step for each key corresponding to one of the optimization problems. The aggregation summarizes the statistics necessary to compute a subset of the parameters for the next EM iteration. The optimal parameters for the next iteration are arrived by computing the relative frequency of each event with respect to its expected count at the current iteration. The emitted optimal parameters by each reducer are written to the HDFS and are fed to the mappers in the next iteration. The project can be subdivided into distinct tasks of programming, testing and documenting the driver, mapper, reducer and the combiner. A list of milestones with their associated deliverables is given below. Timeline: April 26  Aug 15. Milestones: April 26  May 22 (4 weeks): Pre coding tasks. Open communication with my mentor, refine the project's plan and requirements, understand the code styling requirements, expand the knowledge on Hadoop and Mahout's internals. Thoroughly familiarize with the classes within the classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and math packages. May 23  June 3 (2 weeks): Driver. Implement and test the class HmmDriver by extending the AbstractJob class and by reusing the code from the KMeansDriver class. June 3  July 1 (4 weeks): Mapper. Implement and test the class HmmMapper. The HmmMapper class will include setup() and map() methods. The setup() method will read in the HmmModel and the parameter values obtained from the previous iteration. The map() method will call the HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() to compute the E step partially. July 1  July 15 (2 weeks): Reducer. Implement and test the class HmmReducer. The reducer will complete the E step by summing over all the occurences emitted by the mappers for the three optimization problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if possible. July 15  July 29 (2 weeks): Combiner. Implement and test the class HmmCombiner. The combiner will reduce the network traffic and improve efficiency by aggregating the values for each of the three keys corresponding to each of the optimization problems for the M stage. Look at the possibility of code reuse from KMeansCombiner. July 29  August 15 (2 weeks): Give an example demonstrating the new parallel BW algorithm by employing the partsofspeech tagger data set also used by the sequential BW. Tidy up code and fix loose ends, conduct final tests, finish any remaining documentation. Additional Information: I am in the final stages of finishing my Master's degree in Electrical and Computer Engineering from the University of Massachusetts Amherst. Working under the guidance of Prof. Wayne Burleson, as part of my Master's research work, I have applied the theory of Markov Decision Process (MDP) to increase the duration of service of mobile computers. This semester I am involved with two course projects involving machine learning over large data sets. In a Bioinformatics class I am mining the RCSB Protein Data Bank to learn the dependence of side chain geometry on the protein's secondary structure. In another project for the Online Social Networks class, I am building an online recommendation system by using the MDP theory and recasting the problem to be an instance of sequential optimal control. I owe much to the open source community as all my research experiments have only been possible due to the freely available Linux distributions, open source performance analyzers, scripting languages such as Python and the suite of GNU tools. I have found the Apache Mahout's developer community vibrant, helpful and welcoming. If selected, I feel that the GSOC 2011 project will be a great learning experience for me from both a technical and professional standpoint and allow me to contribute within my modest means to the overall spirit of open coding. References: [1] A tutorial on hidden markov models and selected applications in speech recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 (1989), pp. 257286. [2] MapReduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. In NIPS (2006), pp. 281288. [3] DataIntensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. Morgan & Claypool 2010.  Proposal End  
Proposal Title: BaumWelch Algorithm on MapReduce for Parallel Hidden Markov Model Training.
Student Name: Dhruv Kumar Student Email: dkumar@ecs.umass.edu Organization/Project: Apache Mahout Assigned Mentor: Proposal Abstract: The BaumWelch algorithm is commonly used for training a Hidden Markov Model as it is numerically stable and provides a guaranteed discovery of the Maximum Likelihood Estimator in the presence of incomplete data. Currently, Apache Mahout has a sequential implementation of the BaumWelch which cannot be scaled to train over large data sets. This project proposes to extend the sequential implementation of the BaumWelch to a parallel, distributed version using the Map Reduce programming framework to allow scalable Hidden Markov Model training. Detailed Description: Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool for applications generating temporal or spatial sequential data. Their relative simplicity of implementation and their ability to discover latent domain knowledge have made them very popular in fields such as DNA sequence alignment, handwriting analysis, voice recognition, computer vision and partsofspeech tagging. A HMM is defined as a tuple (S, O, Theta) where S is a finite set of unobservable, hidden states emitting symbols from a finite observable vocabulary set O according to a probabilistic model Theta. The parameters of the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic transition matrix of the hidden states of size S X S. The elements a_(i,j) of A specify the probability of transitioning from a state i to state j. Matrix B is a size S X O stochastic symbol emission matrix whose elements b_(s, o) provide the probability that a symbol o will be emitted from the hidden state s. The elements pi_(s) of the S length vector Pi determine the probability that the system starts in the hidden state s. The transitions of hidden states are unobservable and follow the Markov property of memorylessness. Rabiner [1] defined three main problems for HMMs: 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the observation sequence, determine the probability that the model generated the observed sequence. This is useful for determining the quality of the model and is solved using the so called Forward algorithm. 2. Decoding: Given the complete model (S, O, Theta) and an observation sequence, determine the hidden state sequence which generated the observed sequence. This can be viewed as an inference problem where the model and observed sequence are used to predict the value of the unobservable random variables. The backward algorithm, also known as the Viterbi decoding algorithm is used for predicting the hidden state sequence. 3. Training: Given the set of hidden states S, the set of observation vocabulary O and the observation sequence, determine the parameters (A, B, Pi) of the model Theta. This problem can be viewed as a statistical machine learning problem of model fitting to a large set of training data. The BaumWelch (BW) algorithm (also called the ForwardBackward algorithm) and the Viterbi training algorithm are commonly used for model fitting. In general, the quality of HMM training can be improved by employing large training vectors but currently, Mahout only supports sequential versions of HMM trainers which are incapable of scaling. Among the Viterbi and the BaumWelch training methods, the BaumWelch algorithm is slower but more accurate, and a better candidate for a parallel implementation for two reasons: (1) The BW is more numerically stable and provides a guaranteed local maximum of the Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In Viterbi training, the MLE is approximated in order to reduce computation time. (2) The BW belongs to the general class of Expectation Maximization (EM) algorithms which naturally fit into the Map Reduce framework [2]. Hence, this project proposes to extend Mahout's current sequential implementation of the BaumWelch HMM trainer to a scalable, distributed case. Since the distributed version of the BW will use the sequential implementations of the Forward and the Backward algorithms to compute the alpha and the beta factors in each iteration, a lot of existing HMM training code will be reused. Specifically, the parallel implementation of the BW algorithm on Map Reduce has been elaborated at great length in [3] by viewing it as a specific case of the ExpectationMaximization algorithm and will be followed for implementation in this project. The BW EM algorithm iteratively refines the model's parameters and consists of two distinct steps in each iterationExpectation and Maximization. In the distributed case, the Expectation step is computed by the mappers and the reducers, while the Maximization is handled by the reducers. Starting from an initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is input to the algorithm, and the end result Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a user specified convergence condition expressed as a fixpoint or when the number of iterations exceeds a user defined value. Expectation computes the posterior probability of each latent variable for each observed variable, weighed by the relative frequency of the observed variable in the input split. The mappers process independent training instances and emit expected state transition and emission counts using the forwardbackward algorithm. The reducers finish Expectation by aggregating the expected counts. The input to a mapper consists of (k, v_o) pairs where k is a unique key and v_o is a string of observed symbols. For each training instance, the mappers emit the same set of keys corresponding to the following three optimization problems to be solved during the Maximization, and their values in a hashmap: (1) Expected number of times a hidden state is reached (Pi). (2) Number of times each observable symbol is generated by each hidden state (B). (3) Number of transitions between each pair of states in the hidden state space (A). The M step computes the updated Theta(i+1) from the values generated during the E part. This involves aggregating the values (as hashmaps) for each key corresponding to one of the optimization problems. The aggregation summarizes the statistics necessary to compute a subset of the parameters for the next EM iteration. The optimal parameters for the next iteration are arrived by computing the relative frequency of each event with respect to its expected count at the current iteration. The emitted optimal parameters by each reducer are written to the HDFS and are fed to the mappers in the next iteration. The project can be subdivided into distinct tasks of programming, testing and documenting the driver, mapper, reducer and the combiner with the Expectation and Maximization parts split between them. For each of these tasks, a new class will be programmed, unit tested and documented within the org.apache.mahout.classifier.sequencelearning.hmm package. A list of milestones, associated deliverable and high level implementation details is given below. Timeline: April 26  Aug 15. Milestones: April 26  May 22 (4 weeks): This is the precoding stage. Open communication with my mentor, refine the project's plan and requirements, understand the community's code styling requirements, expand the knowledge on Hadoop and Mahout internals. Thoroughly familiarize with the classes within the classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and math packages. Write unit tests against the exisitng Mahout code. May 23  June 3 (2 weeks): Implement, test and document the class HmmDriver by extending the AbstractJob class and by reusing the code from the KMeansDriver class. June 3  July 1 (4 weeks): Implement, test and document the class HmmMapper. The HmmMapper class will include setup() and map() methods. The setup() method will read in the HmmModel and the parameter values obtained from the previous iteration. The map() method will call the HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() and complete the Expectation step partially. July 1  July 15 (2 weeks): Implement, test and document the class HmmReducer. The reducer will complete the Espectation step by summing over all the occurences emitted by the mappers for the three optimization problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if possible. July 15  July 29 (2 weeks): Implement, test and document the class HmmCombiner. The combiner will reduce the network traffic and improve efficiency by aggregating the values for each of the three keys corresponding to each of the optimization problems for the Maximization stage in reducers. Look at the possibility of code reuse from the KMeansCombiner class. July 29  August 15 (2 weeks): Test the mapper, reducer, combiner and driver together. Give an example demonstrating the new parallel BW algorithm by employing the partsofspeech tagger data set also used by the sequential BW [4]. Tidy up code and fix loose ends, finish wiki documentation. Additional Information: I am in the final stages of finishing my Master's degree in Electrical and Computer Engineering from the University of Massachusetts Amherst. Working under the guidance of Prof. Wayne Burleson, as part of my Master's research work, I have applied the theory of Markov Decision Process (MDP) to increase the duration of service of mobile computers. This semester I am involved with two course projects involving machine learning over large data sets. In a Bioinformatics class I am mining the RCSB Protein Data Bank to learn the dependence of side chain geometry on the protein's secondary structure. In another project for the Online Social Networks class, I am building an online recommendation system using the MDP. I owe much to the open source community as all my research experiments have only been possible due to the freely available Linux distributions, performance analyzers, scripting languages and documentation. Since joining the Apache dev mailing list, I have found the Apache Mahout's developer community vibrant, helpful and welcoming. If selected, I feel that the GSOC 2011 project will be a great learning experience for me from both a technical and professional standpoint and also allow me to contribute within my modest means to the overall spirit of open source programming. References: [1] A tutorial on hidden markov models and selected applications in speech recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 (1989), pp. 257286. [2] MapReduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. In NIPS (2006), pp. 281288. [3] DataIntensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. Morgan & Claypool 2010. [4] http://flexcrfs.sourceforge.net/#Case_Study 
Field  Original Value  New Value 

Description 
Among the unsupervised learning methods available in the current sequential implementation of HMM training ( A parallel, MapReduce implementation of BW will allow scalable model learning over large data sets. The resulting model can be used for prediction using the current sequential implementation of the Viterbi decoding algorithm. BW is a general case of the Expectation Maximization (EM) algorithm and it is shown that all EM algorithms are candidates for a MapReduce implementation: http://wwwcs.stanford.edu/~ang/papers/nips06mapreducemulticore.pdf. Also, the kmeans clustering is an EM algorithm and has an existing MapReduce implementation in Mahout which hints at the possibility of code reuse to aid parallel BW development. The other possibility for a parallel implementation would be to use the notion of matrix probability as shown by Turin: http://tagh.de/tom/wpcontent/uploads/turin1998.pdf. Although non trivial, a parallel BW implementation would greatly add to the existing set of Mahout's HMM functionality. 
Among the unsupervised learning methods available in the current sequential implementation of HMM training ( A parallel, MapReduce implementation of BW will allow scalable model learning over large data sets. The resulting model can be used for prediction using the current sequential implementation of the Viterbi decoding algorithm. BW is a general case of the Expectation Maximization (EM) algorithm and it is shown that all EM algorithms are candidates for a MapReduce implementation: http://wwwcs.stanford.edu/~ang/papers/nips06mapreducemulticore.pdf. Also, the kmeans clustering is an EM algorithm and has an existing MapReduce implementation in Mahout which hints at the possibility of code reuse to aid parallel BW development. The other possibility for a parallel implementation would be to use the notion of matrix probability as shown by Turin: http://tagh.de/tom/wpcontent/uploads/turin1998.pdf. Although non trivial, a parallel BW implementation would greatly add to the existing set of Mahout's HMM functionality.  Proposal Begin  Proposal Title: Implementation of the BaumWelch Algorithm for parallel Hidden Markov Model training on MapReduce. Student Name: Dhruv Kumar Student Email: dkumar@ecs.umass.edu Organization/Project: Apache Mahout Assigned Mentor: Proposal Abstract: The BaumWelch algorithm is commonly used for training a Hidden Markov Model as it is numerically stable and provides a guaranteed discovery of the Maximum Likelihood Estimator in the presence of incomplete data. Currently, Apache Mahout has a sequential implementation of the BaumWelch which cannot be scaled to train over large datasets. This project proposes to extend the sequential implementation of the BaumWelch to a parallel, distributed version using the Map Reduce programming framework to allow scalable Hidden Markov Model training. Detailed Description: Hidden Markov Models (HMM) are widely used as a probabilistic inference tool for applications generating temporal or spatial sequential data. Their relative simplicity of implementation and their ability to discover latent domain knowledge have made them very popular in fields such as DNA sequence alignment, handwriting analysis, voice recognition, computer vision and partsofspeech tagging. A HMM is defined as a tuple (S, O, Theta) where S is a finite set of unobservable, hidden states emitting symbols from a finite observable vocabulary set O according to a probabilistic model Theta. The parameters of the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic transition matrix of the hidden states of size S X S. The elements a_(i,j) of A specify the probability of transitioning from a state i to state j. Matrix B is a size S X O stochastic symbol emission matrix whose elements b_(s, o) provide the probability that a symbol o will be emitted from the hidden state s. The elements pi_(s) of the S length vector Pi determine the probability that the system starts in the hidden state s. The transitions of hidden states are unobservable and follow the Markov property of memorylessness. Rabiner [1] defined three main problems for HMMs: 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the observation sequence, determine the probability that the model generated the observed sequence. This is useful for determining the quality of the model and is solved using the so called Forward algorithm. 2. Decoding: Given the complete model (S, O, Theta) and an observation sequence, determine the hidden state sequence which generated the observed sequence. This can be viewed as an inference problem where the model and observed sequence are used to predict the value of the unobservable random variables. The backward algorithm, also known as the Viterbi decoding algorithm is used for predicting the hidden state sequence. 3. Training: Given the set of hidden states S, the set of observation vocabulary O and the observation sequence, determine the parameters (A, B, Pi) of the model Theta. This problem can be viewed as a statistical machine learning problem of model fitting to a large set of training data. The BaumWelch (BW) algorithm (also called the ForwardBackward algorithm) and the Viterbi training algorithm are commonly used for model fitting. Since most problems modeled by HMMs have a modest number of hidden and observable states, the sequential versions of the Forward and the Viterbi algorithms (currently implemented in Mahout) are sufficient for the evaluation and decoding purposes. However, often the training data is so large that a single compute node is incapable of handling it. Attempts to prune the training data can lead to lower quality model fitting and may miss out on crucial domain knowldege because of inferior posterior probabilities. Currently, Mahout only supports sequential HMM trainersViterbi and Baum Welch which are incapable of scaling to large data sets. This project proposes to extend Mahout's current sequential implementation of the Baum Welch HMM trainer to a scalable, distributed case. The distributed version of the BW will use the sequential implementations of the Forward and the Backward algorithms in each iteration, hence a lot of exisitng HMM training code will be reused. Among the two sequential HMM training methods, the BaumWelch (BW) or the ForwardBackward algorithm is superiand a better candidate for a parallel implementation for two reasons. (1) The BW is more numerically stable and provides guaranteed discovery of Maximum Likelihood Estimator (MLE) (albiet a local maximum) unlike Viterbi training where the MLE is approximated in order to reduce computation time. (2) The BW belongs to the general class of Expectation Maximization (EM) algorithms which naturally fit into the Map Reduce framework [2]. Specifically, the parallel implementation of the BW algorithm on Map Reduce has been elaborated at great length in [3] and will be followed in the remainder of this proposal. The BW EM algorithm iteratively refines the model's parameters and consists of two distinct steps in each iterationExpectation (E) and Maximization (M). In the distributed case, the E step is computed by the mappers and the reducers, while the M is handled by the reducers. Starting from an initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is input to the algorithm, and the end result Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a user specified convergence condition expressed as a fixpoint or when the number of iterations exceeds a user defined value. The E step computes the posterior probability of each latent varaiable for each observed variable, weighed by the relative frequency of the observered variable in the input split. The mappers process independent training instances and emit expected state transition and emission counts using the forwardbackward algorithm. The reducers finish E by aggregating the expected counts. The input to a mapper consists of (k, v_o) pairs where k is a unique key and v_o is a string of observerd symbols of length n. For each training instance, the mappers emit the same set of keys corresponding to the following three optimization problems to be solved during the M: (1) expected number of times a hidden state is reached, (2) the number of times each observable symbol is generated by each hidden state and, (3) the number of transitions between each pair of states in the hidden state space. The M step computes the updated Theta^(i+1) from the values generated during the E part. This involves aggregating the values obtained in the E step for each key corresponding to one of the optimization problems. The aggregation summarizes the statistics necessary to compute a subset of the parameters for the next EM iteration. The optimal parameters for the next iteration are arrived by computing the relative frequency of each event with respect to its expected count at the current iteration. The emitted optimal parameters by each reducer are written to the HDFS and are fed to the mappers in the next iteration. The project can be subdivided into distinct tasks of programming, testing and documenting the driver, mapper, reducer and the combiner. A list of milestones with their associated deliverables is given below. Timeline: April 26  Aug 15. Milestones: April 26  May 22 (4 weeks): Pre coding tasks. Open communication with my mentor, refine the project's plan and requirements, understand the code styling requirements, expand the knowledge on Hadoop and Mahout's internals. Thoroughly familiarize with the classes within the classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and math packages. May 23  June 3 (2 weeks): Driver. Implement and test the class HmmDriver by extending the AbstractJob class and by reusing the code from the KMeansDriver class. June 3  July 1 (4 weeks): Mapper. Implement and test the class HmmMapper. The HmmMapper class will include setup() and map() methods. The setup() method will read in the HmmModel and the parameter values obtained from the previous iteration. The map() method will call the HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() to compute the E step partially. July 1  July 15 (2 weeks): Reducer. Implement and test the class HmmReducer. The reducer will complete the E step by summing over all the occurences emitted by the mappers for the three optimization problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if possible. July 15  July 29 (2 weeks): Combiner. Implement and test the class HmmCombiner. The combiner will reduce the network traffic and improve efficiency by aggregating the values for each of the three keys corresponding to each of the optimization problems for the M stage. Look at the possibility of code reuse from KMeansCombiner. July 29  August 15 (2 weeks): Give an example demonstrating the new parallel BW algorithm by employing the partsofspeech tagger data set also used by the sequential BW. Tidy up code and fix loose ends, conduct final tests, finish any remaining documentation. Additional Information: I am in the final stages of finishing my Master's degree in Electrical and Computer Engineering from the University of Massachusetts Amherst. Working under the guidance of Prof. Wayne Burleson, as part of my Master's research work, I have applied the theory of Markov Decision Process (MDP) to increase the duration of service of mobile computers. This semester I am involved with two course projects involving machine learning over large data sets. In a Bioinformatics class I am mining the RCSB Protein Data Bank to learn the dependence of side chain geometry on the protein's secondary structure. In another project for the Online Social Networks class, I am building an online recommendation system by using the MDP theory and recasting the problem to be an instance of sequential optimal control. I owe much to the open source community as all my research experiments have only been possible due to the freely available Linux distributions, open source performance analyzers, scripting languages such as Python and the suite of GNU tools. I have found the Apache Mahout's developer community vibrant, helpful and welcoming. If selected, I feel that the GSOC 2011 project will be a great learning experience for me from both a technical and professional standpoint and allow me to contribute within my modest means to the overall spirit of open coding. References: [1] A tutorial on hidden markov models and selected applications in speech recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 (1989), pp. 257286. [2] MapReduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. In NIPS (2006), pp. 281288. [3] DataIntensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. Morgan & Claypool 2010.  Proposal End  