Details

Type: New Feature

Status: Closed

Priority: Major

Resolution: Won't Fix

Affects Version/s: 0.5

Fix Version/s: None

Component/s: None
Description
Proposal Title: Parallel Viterbi algorithm for HMM
Student Name: Sergey Bartunov
Student Email: sbos.net@gmail.com
Organization/Project: Apache Mahout
Assigned Mentor:
Proposal Abstract:
The Viterbi Algorithm is an evaluating algorithm for Hidden Markov Model [1]. It estimates the most likely sequence of hidden states called "Viterbi path" from given sequence of observed states. Viterbi algorithm is also used in Viterbi training which is less numerical stable than BaumWelch algorithm but faster and can be adjusted to have near the same accuracy [2]. Hidden Markov Model is widely used for speech and gesture recognition, partofspeech tagging, bioinformatics etc. At the moment Apache Mahout contains only sequential HMM functionality, and this project is intended to extend it by implementing MapReduce version of Viterbi algorithm which would make Mahout able to evaluate HMM on big amounts of data in parallel mode.
Detailed Description:
The Viterbi algorithms is a quite common dynamic programming algorithm, it's well described in many sources such as [3], [4], [5]. As being popular and needed, Viterbi algorithm already have parallel versions which became a subject of several studies, for example, this paper about implementation for GPU [6]. Some of these papers may contain useful ideas for mapreduce implementation.
There are several possible strategies for parallelization (which one is
better is an open question for the project) I discovered and posted to the mailing list.
See this thread
The most reasonable one is:
Assume we have O  set of oberved state sequences (that's our input data), $O_i$ is ith sequence with length
len($O_i$), K is number of hidden states in our model. {$O_
$} is the observed state of ith sequence at the moment of time t. Let the N be the maximum of len($O_i$).
Let's write all the seqences one above other. We will get the matrix which rows are the input seqeunces. Then take the first L columns of this matrix and name them the "first chunk", then next L columns and so on. Thus we divide our input into the chunks of size L (less or equal to L). L is the parameter of the algorithm. So the nth chunk contain subsequences with $t$ in the iterval [L*n, L*(n+1)).
The key idea is to process each chunk by standard sequential Viterbi algorithm (the existing Mahout code could be reused here) in parallel mode one by one using results of previous chunk processing.
It will require about O((M/P) * K^2) time where M is sum of all sequence lengths if most of input sequences have near the same length (the good case). P here is the number of "cores" / computational nodes. This time complexity means that such strategy scales well.
Here is MapReduce scheme for the good case:
 Map emits (subsequence, probabilites, paths) vector for each subsequence in the chunk, where probabilities is the initial probabilites in case of the first chunk and optimalpath probabilities in all other cases, paths means optimal paths to the first observations of each subsequence.
 Reduce function just performs sequential Viterbi algorithm on these data and returns (probabilities, paths).
 Driver class runs the Map then Reduce for the first chunk, then Map and Reduce for the second, and so on. Then provide the results (probably using another mapreduce combination).
Probably, it's possible to use lazy viterbi [7,8] instead of the standard Viterbi for chunk processing.
Let's focus now on all other cases when first T chunks contain the subsequences with near the same length and all other NT chunks do not. Well, then they could be processed using the next technique (which is strategy number 2 in the list):
At first, denote $V_
{t,k}$ as the probabilty that optimal hidden state path goes through hidden state $k$ at the moment $t$. We need such probabilities to select the most likely (or optimal) path so we need to compute $V_{t,k}$ for every $t$ and $k$ (that is the part of standard Viterbi). I omit here the $i$ index of the sequence, because each seqence is the separate task.
1) process each chunk separately. $t$ lie in the [L*n, L*(n+1)) interval, where n is the number of chunk and to compute Actually to do this, we need to compute $V_
{t,k}$ we need to know $V_
{L*n1,k1}$ for each $k1$. But this value belong to the chunk the previous number (n1) which is processed separately. We need to obtain the max and argmax of $V_
{L*n1, k1}$ for each k1.2) Since we don't know the max's and argmax's let's just assume that optimal path goes through the k1=1 then through k2=2 and so on up to the K, and $V_{L*n1, k1}
$ = 1 (or 0 if we use log(prob) adding instead of prob multiplying) and argmax is k1=1, 2 ... K. Later we may multiply the $V_
{L*(n+1)1, k1}$ by the precise probability max{ $V_
{L*n1, k1}$ } (or add the log of it). Thus we get K^2 possibly optimal paths instead of K.
3) After we process all the chunks we start "connecting" phase.
The first of these NT chunks could be processed by using usual Viterbi processing since we know the results for the previous "good" chunk. All others need to be connected each to other starting from connecting second this first then 3th with 4th and so on. During connection process we throw away all nonoptimal paths (know we know which are optimal and which are not) and leave just K possibly optimal paths.
To handle such difficult cases the Driver class should know the sequence lengths to recognize which MapReduce combination it should use. The resulting time complexity for difficult case is
O((T*L / N) / P) * K^2 + (((TN) * L / N)) / P * K^3). The first term of this sum is time for computing the first T chunks using simple case approach, and the second term is time for computing the rest using difficult case approach. When T > N overall time tend to O((M/P) * K^2). T * L / N is about total length of subsequences in the first T chunks and (((TN) * L / N)) is total length of the rest.
The only problem not discussed here is how to deal with obtained most likely paths. Path length is equal to length of a sequence, so it's nood a good idea to emit optimal path in Map stage. They could be stored separately in the files or HBase since they are required only in getting the results to the user.
This implementation should work (or be compatible) with org.apache.mahout.classifier.sequencelearning.hmm.HmmModel
As you see, Viterbi algorithm use very common dynamic approach. It computes the next computation layer by combining values the previous, that's all we need to know how to implement it in the MapReduce and it shares the mentioned difficulties with all other such dynamic algorithms, so the implentation may be highly resuable.
Timeline:
1. 1 week. Discover Mahout internals, sequential HMM code and code best practices. Think on problem with path storing. I'll work on this stage actually before GSoC starts but I leave here 1 week just to be sure.
2. 5 days. Write the chunk division routines.
3. 5 days. Write the SimpleMapper class for simple good case of data.
4. 1 week. Write the SimpleReducer class for simple good case.
5. 1 week. Write the Driver class for simple good case which will use SimpleMapper and SimpleReducer.
6. 1 week. Collect debug dataset, write tests for the code and find possible problems, fix them.
7. 10 days. Write the HardReducer class for difficult bad case and rewrite Driver class to handle such cases. Perform tests, ensure that everything works.
7. 10 days. Try to get a large dataset to test performance, scalability etc, write and perform such a test (here a community help might be needed).
If there are any problems discuss and analyze them, then fix.
8. 1 week. Try to use some optimized Viterbi implentation (i.e. Lazy Viterbi) in chunk processing phase, measure speed improvement especially for difficult cases.
9. 1 week. Write documentation for all classes, description of algorithm in
wiki and complete example of usage.
Additional Information:
The project relates to this proposal which is about implementing parallel BaumWelch traning algorithm for HMM.
My motivation for this project is that I need such a functionality as a regular Mahout user since I face HMM often in my research work. Also I'm very interested in what kinds or classes of algorithms could be efficiently implemented in MapReduce, I also think that this is important question for the community and such an experience is significant.
The important detail is that I have exams in the university until 16th June. After that I will be ready to work on the project.
References:
[1] Lawrence R. Rabiner (February 1989). "A tutorial on Hidden Markov Models and selected applications in speech recognition". Proceedings of the IEEE 77 (2): 257286. doi:10.1109/5.18626.
[2] Adjusted Viterbi Training. A proof of concept.
[3] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.118.2081&rep=rep1&type=pdf
[4] http://www.cambridge.org/resources/0521882672/7934_kaeslin_dynpro_new.pdf
[5] http://www.kanungo.com/software/hmmtut.pdf
[6] http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5470903
[7] http://www.scribd.com/doc/48568001/LazyViterbiSlides
[8] http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1040367
Activity
 All
 Comments
 Work Log
 History
 Activity
 Transitions
Sorry for the kind of delay, I was sick and could not work on the project. Now I'm catching up on the lost time heatedly. Here is the repository with changes I made while working on the project: https://github.com/sbos/mahout
Here's the notes on what I'd done:
1) during this time I had two exams relevant to the project  machine learning and optimization methods. While preparing for them I thought about my project and found one important mistake  the O((T*L / N) / P) * K^2 + (((TN) * L / N)) / P * K^3) estimation is absolutely wrong!! It means, that if we perform "difficult processing" on chunks of total length N using P reducers, we will get O (N * K^3 / P), so the algorithm scales linearly. And if we set P=2N for example, we should get subN time. Actually, it's not true. No algorithm could process the entire sequence without just doing something on each element. So the time estimation for any such algorithm is max(O(N), O(...)). That means, that "difficult" case implementation is not reasonable and will just occupy the machines.
2) at the same time I'd found this paper http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.159.8361 about paralleling Belief Propagation. It's a more general case analog of Viterbi for loopless Markov Random Fields. Here I found the strict proof of minimum running time and some algorithm for approximate segmentation called Residual Splash which scales linearly and allows to tradeoff between speed and accuracy. I think it will be very cool thing if there will be way to implement it in mapreduce instead of "difficult case". I'll think on it in background.
3) I'd implement a chunk processing utility (PrepareChunk class) in a way other Mahout tools implemented. It takes a set of input files/directories, splits them and writes into SequenceFiles, one for the each chunk processing phase.
4) I tried to initiate a communication with Dhruv to make our projects compatible at interface level and to share some experience in HMM, I hope it will be ok.
Here's the detailed chunk processing scheme now:
1) inputs are split into chunks as [(input_name, chunk_number), observations].
2) for each chunk number map the chunk data and the initial probability data which is formatted as [(input_name, chunk_number), probabilities]. Both observations and probabilities are instances of GenericViterbiData class to make them able to appear in one reduce flow.
3) reducer receives the (input_name, chunk_number), [observations, probabilities] and performs the forward algorithm on the observations with provided initial probabilities (which are taken the previous chunks) and emits (input_name, chunk_number+1), resulting probabilities. This output is saved in the input directory of the next chunk input data. Also it stores the optimal path matrix in the sequence files with the same keys somewhere.
4) after the last chunk was processed by forward algorithm, we may perform the backward algorithm by the same manner collecting the optimal path.
This scheme is much more effective than naive approach I described in the proposal. It reduces I/O and makes the algorithm be able to handle amounts of data that do not fit into memory. The only requirement is that chunk of desired size must fit.