# Proposal for Implementing Hidden Markov Model

## Details

• Type: New Feature
• Status: Closed
• Priority: Minor
• Resolution: Fixed
• Affects Version/s: 0.4
• Fix Version/s:
• Component/s: None
• Labels:
None

## Description

#### Overview

This is a project proposal for a summer-term university project to write a (sequential) HMM implementation for Mahout. Five students will work on this project as part of a course mentored by Isabel Drost.

#### Abstract:

Hidden Markov Models are used in multiple areas of Machine Learning, such as speech recognition, handwritten letter recognition or natural language processing. A Hidden Markov Model (HMM) is a statistical model of a process consisting of two (in our case discrete) random variables O and Y, which change their state sequentially. The variable Y with states

{y_1, ... , y_n}

is called the "hidden variable", since its state is not directly observable. The state of Y changes sequentially with a so called - in our case first-order - Markov Property. This means, that the state change probability of Y only depends on its current state and does not change in time. Formally we write: P(Y(t+1)=y_i|Y(0)...Y(t)) = P(Y(t+1)=y_i|Y(t)) = P(Y(2)=y_i|Y(1)). The variable O with states

{o_1, ... , o_m}

is called the "observable variable", since its state can be directly observed. O does not have a Markov Property, but its state propability depends statically on the current state of Y.

Formally, an HMM is defined as a tuple M=(n,m,P,A,B), where n is the number of hidden states, m is the number of observable states, P is an n-dimensional vector containing initial hidden state probabilities, A is the nxn-dimensional "transition matrix" containing the transition probabilities such that A[i,j]=P(Y(t)=y_i|Y(t-1)=y_j) and B is the mxn-dimensional "observation matrix" containing the observation probabilties such that B[i,j]= P(O=o_i|Y=y_j).

Rabiner [1] defined three main problems for HMM models:

1. Evaluation: Given a sequence O of observations and a model M, what is the probability P(O|M) that sequence O was generated by model M. The Evaluation problem can be efficiently solved using the Forward algorithm
2. Decoding: Given a sequence O of observations and a model M, what is the most likely sequence Y*=argmax(Y) P(O|M,Y) of hidden variables to generate this sequence. The Decoding problem can be efficiently sovled using the Viterbi algorithm.
3. Learning: Given a sequence O of observations, what is the most likely model M*=argmax(M)P(O|M) to generate this sequence. The Learning problem can be efficiently solved using the Baum-Welch algorithm.

The target of each milestone is defined as the implementation for the given goals, the respective documentation and unit tests for the implementation.

#### Timeline

Mid of May 2010 - Mid of July 2010

#### Milestones

I) Define an HMM class based on Apache Mahout Math package offering interfaces to set model parameters, perform consistency checks, perform output prediction.
1 week from May 18th till May 25th.

II) Write sequential implementations of forward (cf. problem 1 [1]) and backward algorithm.
2 weeks from May 25th till June 8th.

III) Write a sequential implementation of Viterbi algorithm (cf. problem 2 [1]), based on existing forward algorithm implementation.
2 weeks from June 8th till June 22nd

IV) Have a running sequential implementation of Baum-Welch algorithm for model parameter learning (application II [ref]), based on existing forward/backward algorithm implementation.
2 weeks from June 8th till June 22nd

V) Provide a usage example of HMM implementation, demonstrating all three problems.
2 weeks from June 22nd till July 6th

VI) Finalize documentation and implemenation, clean up open ends.
1 week from July 6th till July 13th

#### References:

[1] Lawrence R. Rabiner (February 1989). "A tutorial on Hidden Markov Models and selected applications in speech recognition". Proceedings of the IEEE 77 (2): 257-286. doi:10.1109/5.18626.

## Attachments

1. MAHOUT-396.diff
112 kB
Isabel Drost-Fromm
2. MAHOUT-396.patch
108 kB
Max Heimel
3. MAHOUT-396.diff
106 kB
Max Heimel
4. MAHOUT-396.diff
46 kB
Max Heimel
5. MAHOUT-396.diff
43 kB
Max Heimel
6. MAHOUT-396.diff
28 kB
Max Heimel
7. MAHOUT-396.diff
22 kB
Max Heimel

## Activity

Hide
Ted Dunning added a comment -

Great stuff.

Hopefully it will be possible to adapt this to a parallel implementation. Since the Baum-Welch algorithm is an instance of an EM algorithm map-reduce implementation should be simple, much as it is with k-means. Moreover, much of the code in a map-reduce implementation would be shared with a sequential version.

Show
Ted Dunning added a comment - Great stuff. Hopefully it will be possible to adapt this to a parallel implementation. Since the Baum-Welch algorithm is an instance of an EM algorithm map-reduce implementation should be simple, much as it is with k-means. Moreover, much of the code in a map-reduce implementation would be shared with a sequential version.
Hide
Max Heimel added a comment -

This patch adds the base HMM model class and a prediction mechanism to predict a sequence of output states from A model. Additionally, the stubs for further implementation are added. We still have to add unit tests for the HMM base class, which will be done with the next update (adding forward/backward algorithm and further helper methods).

Show
Max Heimel added a comment - This patch adds the base HMM model class and a prediction mechanism to predict a sequence of output states from A model. Additionally, the stubs for further implementation are added. We still have to add unit tests for the HMM base class, which will be done with the next update (adding forward/backward algorithm and further helper methods).
Hide
Max Heimel added a comment -

renamed patch file to stick with naming conventions

Show
Max Heimel added a comment - renamed patch file to stick with naming conventions
Hide
Robin Anil added a comment -

Few comments on the code tyle. See Mahout core classes to see the code and naming conventions

1. package org.apache.mahout.sequenceLearning.hmm;
remove capitalization and change to package org.apache.mahout.classifier.sequencelearning.hmm;

2. Use the eclipse code style sheet.

3. java.util.Vector<Integer> outputSequence
dont use Java.vector, kinda confuses with the Mahout vector. For integer arrays use the super-fast o.a.m.math.list.IntArrayList

4. Before you put the final patch, put the Apache License too.

Show
Robin Anil added a comment - Few comments on the code tyle. See Mahout core classes to see the code and naming conventions 1. package org.apache.mahout.sequenceLearning.hmm; remove capitalization and change to package org.apache.mahout.classifier.sequencelearning.hmm; 2. Use the eclipse code style sheet. 3. java.util.Vector<Integer> outputSequence dont use Java.vector, kinda confuses with the Mahout vector. For integer arrays use the super-fast o.a.m.math.list.IntArrayList 4. Before you put the final patch, put the Apache License too.
Hide
Benson Margulies added a comment -

Why inside 'classifier'? Are you defining all sequence problems as classification?

Show
Benson Margulies added a comment - Why inside 'classifier'? Are you defining all sequence problems as classification?
Hide
Grant Ingersoll added a comment -

Would be great if you could fill in the HMM section under https://cwiki.apache.org/MAHOUT/algorithms.html

Show
Grant Ingersoll added a comment - Would be great if you could fill in the HMM section under https://cwiki.apache.org/MAHOUT/algorithms.html
Hide
Max Heimel added a comment -

Worked in the suggestions made by Robin Anil (move to classifier/sequencelearning/hmm, use eclipe code style sheet, add apache licese, use IntArrayList instead of java.util.vector<Integer>).

Show
Max Heimel added a comment - Worked in the suggestions made by Robin Anil (move to classifier/sequencelearning/hmm, use eclipe code style sheet, add apache licese, use IntArrayList instead of java.util.vector<Integer>).
Hide
Max Heimel added a comment -

With this patch, the HMM implementation contains the forward and backward algorithm plus unit tests. There might however still be interface changes at how the HMMAlgorithms Object is handled...

Show
Max Heimel added a comment - With this patch, the HMM implementation contains the forward and backward algorithm plus unit tests. There might however still be interface changes at how the HMMAlgorithms Object is handled...
Hide
Max Heimel added a comment -

Patch now contains implementation & unit tests of Viterbi algorithm for HMM decoding.

Show
Max Heimel added a comment - Patch now contains implementation & unit tests of Viterbi algorithm for HMM decoding.
Hide
Max Heimel added a comment -

I just submitted the latest HMM patch, which contains implementations for the forward/backward and Viterbi algorithm. This allows solving the problems of decoding and evaluatiing a Hidden Markov model. The next step on the HMM Agenda will be the Baum-Welch algorithm to tackle the learning problem.

The open to-dos are:
1) Implement the Baum-Welch algorithm and write unit tests for the implementation. (I hope to finish this task by next week... )
2) Find real-world examples that make use of the HMM implementation to demonstrate both usage and usefullness (where to find good data sets?)
3) Implement methods for serializing/deserializing an HMMModel (to file? to string? both?)
4) Write logarithmically scaled versions of the Hmm related algorithms to better handle numeric issues with very low probabiliites.
5) Think about adding interfaces to the HmmEvaluator methods, that accept state names instead of state numbers to provide more user convenience in using the HMM implementation.

Do you have any further remarks, comments, suggestions?

Show
Max Heimel added a comment - I just submitted the latest HMM patch, which contains implementations for the forward/backward and Viterbi algorithm. This allows solving the problems of decoding and evaluatiing a Hidden Markov model. The next step on the HMM Agenda will be the Baum-Welch algorithm to tackle the learning problem. The open to-dos are: 1) Implement the Baum-Welch algorithm and write unit tests for the implementation. (I hope to finish this task by next week... ) 2) Find real-world examples that make use of the HMM implementation to demonstrate both usage and usefullness (where to find good data sets?) 3) Implement methods for serializing/deserializing an HMMModel (to file? to string? both?) 4) Write logarithmically scaled versions of the Hmm related algorithms to better handle numeric issues with very low probabiliites. 5) Think about adding interfaces to the HmmEvaluator methods, that accept state names instead of state numbers to provide more user convenience in using the HMM implementation. Do you have any further remarks, comments, suggestions?
Hide
Qiuyan Xu added a comment -

Marc and I will do the 2. task:
Find real-world examples that make use of the HMM implementation to demonstrate both usage and usefullness

Show
Qiuyan Xu added a comment - Marc and I will do the 2. task: Find real-world examples that make use of the HMM implementation to demonstrate both usage and usefullness
Hide
Isabel Drost-Fromm added a comment -

Assigning to Max Heimel who is doing most of the implementation and integration work so far.

Show
Isabel Drost-Fromm added a comment - Assigning to Max Heimel who is doing most of the implementation and integration work so far.
Hide
Max Heimel added a comment -

Patch containing the full HMM implementation. This includes implementations of forward, backward, Viterbi algorithm and three learning algorithms (supervised, Viterbi, Baum-Welch). All algorithms are available in a normal and log-scaled variant. The HMM model can now be exported to JSON.
This patch also contains a small demo application that implements a POS tagger using the HMM implementation. Using the test/training data sets from http://flexcrfs.sourceforge.net/#Case_Study this demo-application achieves a tagging accuracy of 94% by applying supervised learning. The learning and tagging process takes less than a second on a 2,6Ghz AMD dual core processor with 5Gb of RAM running Ubuntu 10.04.

Show
Max Heimel added a comment - Patch containing the full HMM implementation. This includes implementations of forward, backward, Viterbi algorithm and three learning algorithms (supervised, Viterbi, Baum-Welch). All algorithms are available in a normal and log-scaled variant. The HMM model can now be exported to JSON. This patch also contains a small demo application that implements a POS tagger using the HMM implementation. Using the test/training data sets from http://flexcrfs.sourceforge.net/#Case_Study this demo-application achieves a tagging accuracy of 94% by applying supervised learning. The learning and tagging process takes less than a second on a 2,6Ghz AMD dual core processor with 5Gb of RAM running Ubuntu 10.04.
Hide
Max Heimel added a comment -

Patch containing the full HMM implementation. This includes implementations of forward, backward, Viterbi algorithm and three learning algorithms (supervised, Viterbi, Baum-Welch). All algorithms are available in a normal and log-scaled variant. The HMM model can now be exported to JSON.
This patch also contains a small demo application that implements a POS tagger using the HMM implementation. Using the test/training data sets from http://flexcrfs.sourceforge.net/#Case_Study this demo-application achieves a tagging accuracy of 94% by applying supervised learning. The learning and tagging process takes less than a second on a 2,6Ghz AMD dual core processor with 5Gb of RAM running Ubuntu 10.04.

Show
Max Heimel added a comment - Patch containing the full HMM implementation. This includes implementations of forward, backward, Viterbi algorithm and three learning algorithms (supervised, Viterbi, Baum-Welch). All algorithms are available in a normal and log-scaled variant. The HMM model can now be exported to JSON. This patch also contains a small demo application that implements a POS tagger using the HMM implementation. Using the test/training data sets from http://flexcrfs.sourceforge.net/#Case_Study this demo-application achieves a tagging accuracy of 94% by applying supervised learning. The learning and tagging process takes less than a second on a 2,6Ghz AMD dual core processor with 5Gb of RAM running Ubuntu 10.04.
Hide
Max Heimel added a comment -

The latest patch contains the finished implementation and test suite of sequential Hidden Markov Models for Apache Mahout and should be ready for review.

The implementation now offers all the required functionality (initialization, evaluation, training, loading/storing) and should work fairly efficient. All algorithms are implemented to optionally use log-likelihoods to trade off computation time for increased numerical stability in case of long output sequences. Models can be serialized and deserialized from Json. There are three training algorithms tackling different usage scenarios (supervised, Viterbi, Baum-Welch) available.

I included a small demo-program (org.apache.mahout.classifier.sequencelearning.hmm.example.PosTagger.java) that demonstrates usage of the model class by implementing a simple PartOfSpeech-tagger using the training/test data sets from http://flexcrfs.sourceforge.net/#Case_Study. Using simple supervised learning on the training data set & assuming unknown words to be of type PNP (proper noun present), the tagger reaches a test accuracy of 94%. The program output with timings on my Athlon 2.6Ghz dual core with 4GB of RAM running Ubuntu 10.04 is:
"Using URL: http://www.jaist.ac.jp/~hieuxuan/flexcrfs/CoNLL2000-NP/train.txt
Reading and parsing training data file ... done in 22.479 seconds!
Read 211727 lines containing 8936 sentences with a total of 19122 distinct words and 43 distinct POS tags.
Training HMM model ... done in 0.719 seconds!
Reading and parsing test data file ... done in 3.189 seconds!
Read 47377 lines containing 2012 sentences.
POS tagging test data file ... done in 0.475 seconds!
Tagged the test file with an error rate of: 0.060176879076345065"

There are still some open ends that need to be looked into. The major points are:
1) Redesigning HmmAlgorithms to handle SparseMatrix/SparseVector more efficiently (via nonZeroIterator)

2) Serializing and Deserializing of a model is currently possible using JSON. However, for large models this becomes highly inefficient. E.g. serializing the model created in the PosTagging example results in an 18MB JSO file that takes much longer to deserialze than a reconstruction from the training data takes. It would thus probably be a good idea to look into binary serialization/deserialization.

3) Convergence check for Viterbi/Baum-Welch: currently the convergenc check uses an "ad-hoc" matrix difference. Since BW is an EM algorithm, it would probably be mathematically more sound to switch to using the model likelihood for the convergence check.

4) Parallelization: The traditional algorithms (forward,backward,viterbi,baum-welch) are probably hard to parallelize using M/R - there is some prior work regarding parallelization, but I haven't quite looked into this yet. A more suitable approach to HMM parallelization would be to train in parallel on multiple sequences (e.g. one per mapper) and then merge the resulting HMMs in the reducer step.

Show
Hide
Qiuyan Xu added a comment -

We've made another test over the same training data and test data with a statistic approach. For each word which also exists in the training data, we assign it with the corresponding Pos tag, that appeared most frequently. For all unknown words, we assign it with the type NNP. The precision turned out to be 91.7%.

Show
Qiuyan Xu added a comment - We've made another test over the same training data and test data with a statistic approach. For each word which also exists in the training data, we assign it with the corresponding Pos tag, that appeared most frequently. For all unknown words, we assign it with the type NNP. The precision turned out to be 91.7%.
Hide
Isabel Drost-Fromm added a comment -

Patch applies cleanly with "-p1" (was generated from a git clone of Mahout), builds, all tests green after applying it.

Some comments after taking an initial look at the code:

• Please don't use "System.out.println" anywhere in the code - use Loggers instead.
• I still see quite a few int Arrays in the example code. Could you please provide some explanation of why you did not use the o.a.m.math.list.IntArrayList as proposed by Robin?

Rest of the code looks to me otherwise - though I am to be considered highly subjective in that matter. Would be glad if a second Mahout developer had time to have a look.

Just for the record: The current implementation is sequential-only. During the past few weeks I have come accross a few publications that might be interesting for follow-up work: "Scaling the iHMM: Parallelization versus Hadoop". In addition there seem the have been works in the direction of spectral learning algorithms for HMMs that might be interesting: "Hilbert Space Embeddings of Hidden Markov Models (L. Song, B. Boots, S. Saddiqi, G. Gordon, A. Smola)" and "A spectral algorithm for learning hidden Markov models (D. Hsu, S. Kakade, T. Zhang)"

Show
Hide
Isabel Drost-Fromm added a comment -

Hmpf - please add a "good" between: "Rest of the code looks" and "to me otherwise".

Show
Isabel Drost-Fromm added a comment - Hmpf - please add a "good" between: "Rest of the code looks" and "to me otherwise".
Hide
Max Heimel added a comment -

Sorry for the long delay, I have some exams coming up, so I had to focus on learning

Here are my comments reagarding your review (I will comment on the paralleization papers, once I read through them

• I have fixed the issue with using System.out by replacing it by calls to LOG.info( .. ) in the PosTagger example.
• I have fixed some build errors within the unit tests that must have come up since you reviewed the code.
• I moved the PosTagger example to mahout-examples subproject
• Regarding the IntArrayList: There actually was a problem I encountered which made me switch to int[] - however I cannot recall it. Shame on me However, since int[] only gets used in places where the array size is guaranteed not to change, and since you can easily get an int[] from an IntArray, I actually don't see an advantage of using IntArray over int[]. If I am assessing this wrong, please let me know and I will look into replacing the int[] with IntArray objects.

I have attached the latest patch.

Show
Max Heimel added a comment - Sorry for the long delay, I have some exams coming up, so I had to focus on learning Here are my comments reagarding your review (I will comment on the paralleization papers, once I read through them I have fixed the issue with using System.out by replacing it by calls to LOG.info( .. ) in the PosTagger example. I have fixed some build errors within the unit tests that must have come up since you reviewed the code. I moved the PosTagger example to mahout-examples subproject Regarding the IntArrayList: There actually was a problem I encountered which made me switch to int[] - however I cannot recall it. Shame on me However, since int[] only gets used in places where the array size is guaranteed not to change, and since you can easily get an int[] from an IntArray, I actually don't see an advantage of using IntArray over int[]. If I am assessing this wrong, please let me know and I will look into replacing the int[] with IntArray objects. I have attached the latest patch.
Hide
Max Heimel added a comment -

Latest patch, containing fixes for the comments raised by Isabel during code review.

Show
Max Heimel added a comment - Latest patch, containing fixes for the comments raised by Isabel during code review.
Hide
Isabel Drost-Fromm added a comment -

Patch should be applied with -p1 set.

One more sweep over the patch to remove style issues: Fixed indent, a few variable names, some typos in the documentation, added some documentation where missing, removed unused imports/variables.

Max, please be so kind to double-check that in the course of code cleanup nothing broke and especially that documentation is still correct. I marked one particular method with a TODO that had an unclear comment.

Sean, as you have done so much work in recent weeks to get our code base clean - it would be really great, if you could double-check that I neither missed a problem nor introduced new ones - tried to stick with our checkstyle coding conventions, findbugs configuration and the additional checks that intellij has built-in.

Show
Isabel Drost-Fromm added a comment - Patch should be applied with -p1 set. One more sweep over the patch to remove style issues: Fixed indent, a few variable names, some typos in the documentation, added some documentation where missing, removed unused imports/variables. Max, please be so kind to double-check that in the course of code cleanup nothing broke and especially that documentation is still correct. I marked one particular method with a TODO that had an unclear comment. Sean, as you have done so much work in recent weeks to get our code base clean - it would be really great, if you could double-check that I neither missed a problem nor introduced new ones - tried to stick with our checkstyle coding conventions, findbugs configuration and the additional checks that intellij has built-in.
Hide
Ted Dunning added a comment -

Isabel,

Is this ready to check in?

Show
Ted Dunning added a comment - Isabel, Is this ready to check in?
Hide
Isabel Drost-Fromm added a comment -

Changes committed.

Show
Isabel Drost-Fromm added a comment - Changes committed.
Hide
Isabel Drost-Fromm added a comment -

Thanks to Max Heimel, Marc Hofer, Qiuyan Xu and Van Long Nguyen for contributing the patch.

Show
Isabel Drost-Fromm added a comment - Thanks to Max Heimel, Marc Hofer, Qiuyan Xu and Van Long Nguyen for contributing the patch.
Hide

Integrated in Mahout-Quality #319 (See https://hudson.apache.org/hudson/job/Mahout-Quality/319/)
MAHOUT-396 - add HMM support for sequence classification. Thanks to Max
Heimel, Marc Hofer, Qiuyan Xu and Van Long Nguyen for contributing the
patch.

Show
Hudson added a comment - Integrated in Mahout-Quality #319 (See https://hudson.apache.org/hudson/job/Mahout-Quality/319/ ) MAHOUT-396 - add HMM support for sequence classification. Thanks to Max Heimel, Marc Hofer, Qiuyan Xu and Van Long Nguyen for contributing the patch.

## People

• Assignee:
Max Heimel
Reporter:
Max Heimel