Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-652

[GSoC Proposal] Parallel Viterbi algorithm for HMM




      Proposal Title: Parallel Viterbi algorithm for HMM

      Student Name: Sergey Bartunov

      Student E-mail: sbos.net@gmail.com

      Organization/Project: Apache Mahout

      Assigned Mentor:

      Proposal Abstract:

      The Viterbi Algorithm is an evaluating algorithm for Hidden Markov Model [1]. It estimates the most likely sequence of hidden states called "Viterbi path" from given sequence of observed states. Viterbi algorithm is also used in Viterbi training which is less numerical stable than Baum-Welch algorithm but faster and can be adjusted to have near the same accuracy [2]. Hidden Markov Model is widely used for speech and gesture recognition, part-of-speech tagging, bioinformatics etc. At the moment Apache Mahout contains only sequential HMM functionality, and this project is intended to extend it by implementing Map-Reduce version of Viterbi algorithm which would make Mahout able to evaluate HMM on big amounts of data in parallel mode.

      Detailed Description:

      The Viterbi algorithms is a quite common dynamic programming algorithm, it's well described in many sources such as [3], [4], [5]. As being popular and needed, Viterbi algorithm already have parallel versions which became a subject of several studies, for example, this paper about implementation for GPU [6]. Some of these papers may contain useful ideas for map-reduce implementation.

      There are several possible strategies for parallelization (which one is
      better is an open question for the project) I discovered and posted to the mailing list.
      See this thread

      The most reasonable one is:
      Assume we have O - set of oberved state sequences (that's our input data), $O_i$ is i-th sequence with length
      len($O_i$), K is number of hidden states in our model. {$O_

      {i, t}

      $} is the observed state of i-th sequence at the moment of time t. Let the N be the maximum of len($O_i$).

      Let's write all the seqences one above other. We will get the matrix which rows are the input seqeunces. Then take the first L columns of this matrix and name them the "first chunk", then next L columns and so on. Thus we divide our input into the chunks of size L (less or equal to L). L is the parameter of the algorithm. So the n-th chunk contain subsequences with $t$ in the iterval [L*n, L*(n+1)).

      The key idea is to process each chunk by standard sequential Viterbi algorithm (the existing Mahout code could be reused here) in parallel mode one by one using results of previous chunk processing.

      It will require about O((M/P) * K^2) time where M is sum of all sequence lengths if most of input sequences have near the same length (the good case). P here is the number of "cores" / computational nodes. This time complexity means that such strategy scales well.

      Here is Map-Reduce scheme for the good case:

      • Map emits (subsequence, probabilites, paths) vector for each subsequence in the chunk, where probabilities is the initial probabilites in case of the first chunk and optimal-path probabilities in all other cases, paths means optimal paths to the first observations of each subsequence.
      • Reduce function just performs sequential Viterbi algorithm on these data and returns (probabilities, paths).
      • Driver class runs the Map then Reduce for the first chunk, then Map and Reduce for the second, and so on. Then provide the results (probably using another map-reduce combination).

      Probably, it's possible to use lazy viterbi [7,8] instead of the standard Viterbi for chunk processing.

      Let's focus now on all other cases when first T chunks contain the subsequences with near the same length and all other N-T chunks do not. Well, then they could be processed using the next technique (which is strategy number 2 in the list):

      At first, denote $V_

      {t,k}$ as the probabilty that optimal hidden state path goes through hidden state $k$ at the moment $t$. We need such probabilities to select the most likely (or optimal) path so we need to compute $V_{t,k}

      $ for every $t$ and $k$ (that is the part of standard Viterbi). I omit here the $i$ index of the sequence, because each seqence is the separate task.

      1) process each chunk separately. $t$ lie in the [L*n, L*(n+1)) interval, where n is the number of chunk and to compute Actually to do this, we need to compute $V_


      $ we need to know $V_


      $ for each $k1$. But this value belong to the chunk the previous number (n-1) which is processed separately. We need to obtain the max and arg-max of $V_

      {L*n-1, k1}$ for each k1.
      2) Since we don't know the max's and arg-max's let's just assume that optimal path goes through the k1=1 then through k2=2 and so on up to the K, and $V_{L*n-1, k1}

      $ = 1 (or 0 if we use log(prob) adding instead of prob multiplying) and argmax is k1=1, 2 ... K. Later we may multiply the $V_

      {L*(n+1)-1, k1}

      $ by the precise probability max{ $V_

      {L*n-1, k1}

      $ } (or add the log of it). Thus we get K^2 possibly optimal paths instead of K.
      3) After we process all the chunks we start "connecting" phase.

      The first of these N-T chunks could be processed by using usual Viterbi processing since we know the results for the previous "good" chunk. All others need to be connected each to other starting from connecting second this first then 3th with 4th and so on. During connection process we throw away all non-optimal paths (know we know which are optimal and which are not) and leave just K possibly optimal paths.

      To handle such difficult cases the Driver class should know the sequence lengths to recognize which Map-Reduce combination it should use. The resulting time complexity for difficult case is
      O((T*L / N) / P) * K^2 + (((T-N) * L / N)) / P * K^3). The first term of this sum is time for computing the first T chunks using simple case approach, and the second term is time for computing the rest using difficult case approach. When T -> N overall time tend to O((M/P) * K^2). T * L / N is about total length of subsequences in the first T chunks and (((T-N) * L / N)) is total length of the rest.

      The only problem not discussed here is how to deal with obtained most likely paths. Path length is equal to length of a sequence, so it's nood a good idea to emit optimal path in Map stage. They could be stored separately in the files or HBase since they are required only in getting the results to the user.

      This implementation should work (or be compatible) with org.apache.mahout.classifier.sequencelearning.hmm.HmmModel

      As you see, Viterbi algorithm use very common dynamic approach. It computes the next computation layer by combining values the previous, that's all we need to know how to implement it in the Map-Reduce and it shares the mentioned difficulties with all other such dynamic algorithms, so the implentation may be highly resuable.

      1. 1 week. Discover Mahout internals, sequential HMM code and code best practices. Think on problem with path storing. I'll work on this stage actually before GSoC starts but I leave here 1 week just to be sure.
      2. 5 days. Write the chunk division routines.
      3. 5 days. Write the SimpleMapper class for simple good case of data.
      4. 1 week. Write the SimpleReducer class for simple good case.
      5. 1 week. Write the Driver class for simple good case which will use SimpleMapper and SimpleReducer.
      6. 1 week. Collect debug dataset, write tests for the code and find possible problems, fix them.
      7. 10 days. Write the HardReducer class for difficult bad case and rewrite Driver class to handle such cases. Perform tests, ensure that everything works.
      7. 10 days. Try to get a large dataset to test performance, scalability etc, write and perform such a test (here a community help might be needed).
      If there are any problems discuss and analyze them, then fix.
      8. 1 week. Try to use some optimized Viterbi implentation (i.e. Lazy Viterbi) in chunk processing phase, measure speed improvement especially for difficult cases.
      9. 1 week. Write documentation for all classes, description of algorithm in
      wiki and complete example of usage.

      Additional Information:

      The project relates to this proposal which is about implementing parallel Baum-Welch traning algorithm for HMM.

      My motivation for this project is that I need such a functionality as a regular Mahout user since I face HMM often in my research work. Also I'm very interested in what kinds or classes of algorithms could be efficiently implemented in Map-Reduce, I also think that this is important question for the community and such an experience is significant.

      The important detail is that I have exams in the university until 16th June. After that I will be ready to work on the project.


      [1] Lawrence R. Rabiner (February 1989). "A tutorial on Hidden Markov Models and selected applications in speech recognition". Proceedings of the IEEE 77 (2): 257-286. doi:10.1109/5.18626.
      [2] Adjusted Viterbi Training. A proof of concept.
      [3] http://citeseerx.ist.psu.edu/viewdoc/download?doi=
      [4] http://www.cambridge.org/resources/0521882672/7934_kaeslin_dynpro_new.pdf
      [5] http://www.kanungo.com/software/hmmtut.pdf
      [6] http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5470903
      [7] http://www.scribd.com/doc/48568001/Lazy-Viterbi-Slides
      [8] http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1040367


        1. GSoC performance.csv
          0.2 kB
          Sergey Bartunov
        2. parallel-viterbi.patch
          101 kB
          Sergey Bartunov



            isabel Isabel Drost-Fromm
            sbos Sergey Bartunov
            0 Vote for this issue
            2 Start watching this issue