Mahout
  1. Mahout
  2. MAHOUT-396

Proposal for Implementing Hidden Markov Model

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.4
    • Fix Version/s: 0.4
    • Component/s: None
    • Labels:
      None

      Description

      Overview

      This is a project proposal for a summer-term university project to write a (sequential) HMM implementation for Mahout. Five students will work on this project as part of a course mentored by Isabel Drost.

      Abstract:

      Hidden Markov Models are used in multiple areas of Machine Learning, such as speech recognition, handwritten letter recognition or natural language processing. A Hidden Markov Model (HMM) is a statistical model of a process consisting of two (in our case discrete) random variables O and Y, which change their state sequentially. The variable Y with states

      {y_1, ... , y_n}

      is called the "hidden variable", since its state is not directly observable. The state of Y changes sequentially with a so called - in our case first-order - Markov Property. This means, that the state change probability of Y only depends on its current state and does not change in time. Formally we write: P(Y(t+1)=y_i|Y(0)...Y(t)) = P(Y(t+1)=y_i|Y(t)) = P(Y(2)=y_i|Y(1)). The variable O with states

      {o_1, ... , o_m}

      is called the "observable variable", since its state can be directly observed. O does not have a Markov Property, but its state propability depends statically on the current state of Y.

      Formally, an HMM is defined as a tuple M=(n,m,P,A,B), where n is the number of hidden states, m is the number of observable states, P is an n-dimensional vector containing initial hidden state probabilities, A is the nxn-dimensional "transition matrix" containing the transition probabilities such that A[i,j]=P(Y(t)=y_i|Y(t-1)=y_j) and B is the mxn-dimensional "observation matrix" containing the observation probabilties such that B[i,j]= P(O=o_i|Y=y_j).

      Rabiner [1] defined three main problems for HMM models:

      1. Evaluation: Given a sequence O of observations and a model M, what is the probability P(O|M) that sequence O was generated by model M. The Evaluation problem can be efficiently solved using the Forward algorithm
      2. Decoding: Given a sequence O of observations and a model M, what is the most likely sequence Y*=argmax(Y) P(O|M,Y) of hidden variables to generate this sequence. The Decoding problem can be efficiently sovled using the Viterbi algorithm.
      3. Learning: Given a sequence O of observations, what is the most likely model M*=argmax(M)P(O|M) to generate this sequence. The Learning problem can be efficiently solved using the Baum-Welch algorithm.

      The target of each milestone is defined as the implementation for the given goals, the respective documentation and unit tests for the implementation.

      Timeline

      Mid of May 2010 - Mid of July 2010

      Milestones

      I) Define an HMM class based on Apache Mahout Math package offering interfaces to set model parameters, perform consistency checks, perform output prediction.
      1 week from May 18th till May 25th.

      II) Write sequential implementations of forward (cf. problem 1 [1]) and backward algorithm.
      2 weeks from May 25th till June 8th.

      III) Write a sequential implementation of Viterbi algorithm (cf. problem 2 [1]), based on existing forward algorithm implementation.
      2 weeks from June 8th till June 22nd

      IV) Have a running sequential implementation of Baum-Welch algorithm for model parameter learning (application II [ref]), based on existing forward/backward algorithm implementation.
      2 weeks from June 8th till June 22nd

      V) Provide a usage example of HMM implementation, demonstrating all three problems.
      2 weeks from June 22nd till July 6th

      VI) Finalize documentation and implemenation, clean up open ends.
      1 week from July 6th till July 13th

      References:

      [1] Lawrence R. Rabiner (February 1989). "A tutorial on Hidden Markov Models and selected applications in speech recognition". Proceedings of the IEEE 77 (2): 257-286. doi:10.1109/5.18626.

      Potential test data sets:
      http://www.cnts.ua.ac.be/conll2000/chunking/
      http://archive.ics.uci.edu/ml/datasets/Character+Trajectories

      1. MAHOUT-396.patch
        108 kB
        Max Heimel
      2. MAHOUT-396.diff
        22 kB
        Max Heimel
      3. MAHOUT-396.diff
        28 kB
        Max Heimel
      4. MAHOUT-396.diff
        43 kB
        Max Heimel
      5. MAHOUT-396.diff
        46 kB
        Max Heimel
      6. MAHOUT-396.diff
        106 kB
        Max Heimel
      7. MAHOUT-396.diff
        112 kB
        Isabel Drost-Fromm

        Activity

          People

          • Assignee:
            Max Heimel
            Reporter:
            Max Heimel
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development