[GSoC Proposal] Parallel Viterbi algorithm for HMM

Details

• Type: New Feature
• Status: Closed
• Priority: Major
• Resolution: Won't Fix
• Affects Version/s: 0.5
• Fix Version/s: None
• Component/s: None
• Labels:

Description

Proposal Title: Parallel Viterbi algorithm for HMM

Student Name: Sergey Bartunov

Student E-mail: sbos.net@gmail.com

Organization/Project: Apache Mahout

Assigned Mentor:

Proposal Abstract:

The Viterbi Algorithm is an evaluating algorithm for Hidden Markov Model [1]. It estimates the most likely sequence of hidden states called "Viterbi path" from given sequence of observed states. Viterbi algorithm is also used in Viterbi training which is less numerical stable than Baum-Welch algorithm but faster and can be adjusted to have near the same accuracy [2]. Hidden Markov Model is widely used for speech and gesture recognition, part-of-speech tagging, bioinformatics etc. At the moment Apache Mahout contains only sequential HMM functionality, and this project is intended to extend it by implementing Map-Reduce version of Viterbi algorithm which would make Mahout able to evaluate HMM on big amounts of data in parallel mode.

Detailed Description:

The Viterbi algorithms is a quite common dynamic programming algorithm, it's well described in many sources such as [3], [4], [5]. As being popular and needed, Viterbi algorithm already have parallel versions which became a subject of several studies, for example, this paper about implementation for GPU [6]. Some of these papers may contain useful ideas for map-reduce implementation.

There are several possible strategies for parallelization (which one is
better is an open question for the project) I discovered and posted to the mailing list.

The most reasonable one is:
Assume we have O - set of oberved state sequences (that's our input data), \$O_i\$ is i-th sequence with length
len(\$O_i\$), K is number of hidden states in our model. {\$O_

{i, t}

\$} is the observed state of i-th sequence at the moment of time t. Let the N be the maximum of len(\$O_i\$).

Let's write all the seqences one above other. We will get the matrix which rows are the input seqeunces. Then take the first L columns of this matrix and name them the "first chunk", then next L columns and so on. Thus we divide our input into the chunks of size L (less or equal to L). L is the parameter of the algorithm. So the n-th chunk contain subsequences with \$t\$ in the iterval [L*n, L*(n+1)).

The key idea is to process each chunk by standard sequential Viterbi algorithm (the existing Mahout code could be reused here) in parallel mode one by one using results of previous chunk processing.

It will require about O((M/P) * K^2) time where M is sum of all sequence lengths if most of input sequences have near the same length (the good case). P here is the number of "cores" / computational nodes. This time complexity means that such strategy scales well.

Here is Map-Reduce scheme for the good case:

• Map emits (subsequence, probabilites, paths) vector for each subsequence in the chunk, where probabilities is the initial probabilites in case of the first chunk and optimal-path probabilities in all other cases, paths means optimal paths to the first observations of each subsequence.
• Reduce function just performs sequential Viterbi algorithm on these data and returns (probabilities, paths).
• Driver class runs the Map then Reduce for the first chunk, then Map and Reduce for the second, and so on. Then provide the results (probably using another map-reduce combination).

Probably, it's possible to use lazy viterbi [7,8] instead of the standard Viterbi for chunk processing.

Let's focus now on all other cases when first T chunks contain the subsequences with near the same length and all other N-T chunks do not. Well, then they could be processed using the next technique (which is strategy number 2 in the list):

At first, denote \$V_

{t,k}\$ as the probabilty that optimal hidden state path goes through hidden state \$k\$ at the moment \$t\$. We need such probabilities to select the most likely (or optimal) path so we need to compute \$V_{t,k}

\$ for every \$t\$ and \$k\$ (that is the part of standard Viterbi). I omit here the \$i\$ index of the sequence, because each seqence is the separate task.

1) process each chunk separately. \$t\$ lie in the [L*n, L*(n+1)) interval, where n is the number of chunk and to compute Actually to do this, we need to compute \$V_

{t,k}

\$ we need to know \$V_

{L*n-1,k1}

\$ for each \$k1\$. But this value belong to the chunk the previous number (n-1) which is processed separately. We need to obtain the max and arg-max of \$V_

{L*n-1, k1}\$ for each k1.
2) Since we don't know the max's and arg-max's let's just assume that optimal path goes through the k1=1 then through k2=2 and so on up to the K, and \$V_{L*n-1, k1}

\$ = 1 (or 0 if we use log(prob) adding instead of prob multiplying) and argmax is k1=1, 2 ... K. Later we may multiply the \$V_

{L*(n+1)-1, k1}

\$ by the precise probability max{ \$V_

{L*n-1, k1}

\$ } (or add the log of it). Thus we get K^2 possibly optimal paths instead of K.
3) After we process all the chunks we start "connecting" phase.

The first of these N-T chunks could be processed by using usual Viterbi processing since we know the results for the previous "good" chunk. All others need to be connected each to other starting from connecting second this first then 3th with 4th and so on. During connection process we throw away all non-optimal paths (know we know which are optimal and which are not) and leave just K possibly optimal paths.

To handle such difficult cases the Driver class should know the sequence lengths to recognize which Map-Reduce combination it should use. The resulting time complexity for difficult case is
O((T*L / N) / P) * K^2 + (((T-N) * L / N)) / P * K^3). The first term of this sum is time for computing the first T chunks using simple case approach, and the second term is time for computing the rest using difficult case approach. When T -> N overall time tend to O((M/P) * K^2). T * L / N is about total length of subsequences in the first T chunks and (((T-N) * L / N)) is total length of the rest.

The only problem not discussed here is how to deal with obtained most likely paths. Path length is equal to length of a sequence, so it's nood a good idea to emit optimal path in Map stage. They could be stored separately in the files or HBase since they are required only in getting the results to the user.

This implementation should work (or be compatible) with org.apache.mahout.classifier.sequencelearning.hmm.HmmModel

As you see, Viterbi algorithm use very common dynamic approach. It computes the next computation layer by combining values the previous, that's all we need to know how to implement it in the Map-Reduce and it shares the mentioned difficulties with all other such dynamic algorithms, so the implentation may be highly resuable.

Timeline:
1. 1 week. Discover Mahout internals, sequential HMM code and code best practices. Think on problem with path storing. I'll work on this stage actually before GSoC starts but I leave here 1 week just to be sure.
2. 5 days. Write the chunk division routines.
3. 5 days. Write the SimpleMapper class for simple good case of data.
4. 1 week. Write the SimpleReducer class for simple good case.
5. 1 week. Write the Driver class for simple good case which will use SimpleMapper and SimpleReducer.
6. 1 week. Collect debug dataset, write tests for the code and find possible problems, fix them.
7. 10 days. Write the HardReducer class for difficult bad case and rewrite Driver class to handle such cases. Perform tests, ensure that everything works.
7. 10 days. Try to get a large dataset to test performance, scalability etc, write and perform such a test (here a community help might be needed).
If there are any problems discuss and analyze them, then fix.
8. 1 week. Try to use some optimized Viterbi implentation (i.e. Lazy Viterbi) in chunk processing phase, measure speed improvement especially for difficult cases.
9. 1 week. Write documentation for all classes, description of algorithm in
wiki and complete example of usage.

The project relates to this proposal which is about implementing parallel Baum-Welch traning algorithm for HMM.

My motivation for this project is that I need such a functionality as a regular Mahout user since I face HMM often in my research work. Also I'm very interested in what kinds or classes of algorithms could be efficiently implemented in Map-Reduce, I also think that this is important question for the community and such an experience is significant.

The important detail is that I have exams in the university until 16th June. After that I will be ready to work on the project.

References:

Attachments

1. GSoC performance.csv
0.2 kB
Sergey Bartunov
2. parallel-viterbi.patch
101 kB
Sergey Bartunov

Activity

Hide
Sergey Bartunov added a comment -

Hi! Sorry for such late reply, by some reason I missed the notification email.
As I mentioned in my comment above I've done everything related to Isabel's notes and was waiting to her further instructions. So I'm not sure why it's a backlog (probably, it should be closed instead, see below).
From the point of my current knowledge and experience I have to conclude that this work was a bit crazy and rather experimentally confirmed that (at least exact) inference in sequence-like graphical models is not effective in MapReduce (at least if we consider it as Viterbi-like approach) for many reasons. I had included these doubts here in the issue and in my reports to Isabel (and you may look at the scalability test results attached above).
What I mean is whether there is some work left or not, in my opinion I should give it up.
On the other hand, there were many still many positive moments in this project at least for me, because I learned many things, but also a bit for the community since I told about Mahout to all of my colleagues and friends, and one my ex-employer even adopted Mahout's SVD in his project because of my advertisement) Probably, it is another important part of GSoC.
But still my patch contains one really useful thing - Online Viterbi algorithm and I think it's worth of being committed. I could create another issue for this patch if one thinks it's reasonable.

Show
Sergey Bartunov added a comment - Hi! Sorry for such late reply, by some reason I missed the notification email. As I mentioned in my comment above I've done everything related to Isabel's notes and was waiting to her further instructions. So I'm not sure why it's a backlog (probably, it should be closed instead, see below). From the point of my current knowledge and experience I have to conclude that this work was a bit crazy and rather experimentally confirmed that (at least exact) inference in sequence-like graphical models is not effective in MapReduce (at least if we consider it as Viterbi-like approach) for many reasons. I had included these doubts here in the issue and in my reports to Isabel (and you may look at the scalability test results attached above). What I mean is whether there is some work left or not, in my opinion I should give it up. On the other hand, there were many still many positive moments in this project at least for me, because I learned many things, but also a bit for the community since I told about Mahout to all of my colleagues and friends, and one my ex-employer even adopted Mahout's SVD in his project because of my advertisement) Probably, it is another important part of GSoC. But still my patch contains one really useful thing - Online Viterbi algorithm and I think it's worth of being committed. I could create another issue for this patch if one thinks it's reasonable.
Hide
Sean Owen added a comment -

I guess this timed out? dont' know what ever happened to this, but seems to have not progressed in 8 months.

Show
Sean Owen added a comment - I guess this timed out? dont' know what ever happened to this, but seems to have not progressed in 8 months.
Hide
Sergey Bartunov added a comment - - edited

Here is the wiki page with documentation: https://cwiki.apache.org/confluence/display/MAHOUT/Parallel+Viterbi and https://cwiki.apache.org/confluence/display/MAHOUT/Online+Viterbi
Is there something else I should complete for these patch?

Show
Sergey Bartunov added a comment - - edited Here is the wiki page with documentation: https://cwiki.apache.org/confluence/display/MAHOUT/Parallel+Viterbi and https://cwiki.apache.org/confluence/display/MAHOUT/Online+Viterbi Is there something else I should complete for these patch?
Hide
Isabel Drost-Fromm added a comment -

Though the patch is in quite good shape already especially after the latest modifications moving to Backlog in an effort to remove issues the next release depends on.

Show
Isabel Drost-Fromm added a comment - Though the patch is in quite good shape already especially after the latest modifications moving to Backlog in an effort to remove issues the next release depends on.
Hide
Sergey Bartunov added a comment -

Thanks for the notes, Isabel. I've done with most of them, but I'd like to discuss some of them.

ObservedSequenceWritable: I don't want to add toString method because I don't have clear understanding of what it should print to. By default it could just visualize the content of the underlying sequence, but if it will be very long, it's not good idea to construct the string about the same size. I just don't like such naive toStrings, but if you think that it would be ok, it's not a problem for me. After some first N elements of the sequence it may just print "..." and the total size of the sequence, but it looks like debug information, not string representation.

ForwardViterbiReducer: well, these functions are wrappers for the sequential Viterbi API which I found not so clear and nice. I think the code is looking much more clearer if some programmer could just read "ok, here he's getting initial probabilities". Also I had implemented my own "Forward" algorithm, because there's obvious unnecessary storing of all the probabilities in the sequential API (we need actually only the last ones). But The tests for sequential code depend on these probabilities, so I left sequential code without any changes and wrote my own memory-optimized procedure.

HmmOnlineViterbi: Yes, it's possible to move internal classes outside the main class, but there's need for them inside HmmOnlineViterbi class only. So should they be available somewhere else?

Show
Sergey Bartunov added a comment - Thanks for the notes, Isabel. I've done with most of them, but I'd like to discuss some of them. ObservedSequenceWritable: I don't want to add toString method because I don't have clear understanding of what it should print to. By default it could just visualize the content of the underlying sequence, but if it will be very long, it's not good idea to construct the string about the same size. I just don't like such naive toStrings, but if you think that it would be ok, it's not a problem for me. After some first N elements of the sequence it may just print "..." and the total size of the sequence, but it looks like debug information, not string representation. ForwardViterbiReducer: well, these functions are wrappers for the sequential Viterbi API which I found not so clear and nice. I think the code is looking much more clearer if some programmer could just read "ok, here he's getting initial probabilities". Also I had implemented my own "Forward" algorithm, because there's obvious unnecessary storing of all the probabilities in the sequential API (we need actually only the last ones). But The tests for sequential code depend on these probabilities, so I left sequential code without any changes and wrote my own memory-optimized procedure. HmmOnlineViterbi: Yes, it's possible to move internal classes outside the main class, but there's need for them inside HmmOnlineViterbi class only. So should they be available somewhere else?
Hide
Isabel Drost-Fromm added a comment -

> I did not find any integration of your work into our mahout script - would be really nice to have.

Found it just now - sorry for the noise...

Show
Isabel Drost-Fromm added a comment - > I did not find any integration of your work into our mahout script - would be really nice to have. Found it just now - sorry for the noise...
Hide
Isabel Drost-Fromm added a comment -

Just for the record: Tests pass successfully as well.

Show
Isabel Drost-Fromm added a comment - Just for the record: Tests pass successfully as well.
Hide
Isabel Drost-Fromm added a comment -

Would be great if you could add this information to our wiki. I think the HMM page https://cwiki.apache.org/MAHOUT/hidden-markov-models.html would be the best place to put this - maybe also add some information on when to prefer the parallel implementation over the sequential one (rough rule of thumb sufficient).

Patch still applies cleanly, compiles, tests still running.

Overall a very useful patch that is in good shape already. More detailed comments below.

Some comments that need a bit more work to further improve your patch:

• I did not find any integration of your work into our mahout script - would be really nice to have.
• Only found an integration test. When I think about having to change your code at any point in the future it would be great if you could also provide more fine-grained tests that check smaller units of work. Did you consider using mr-unit?

• For most of your classes I'd prefer some more documentation both on the package-, class- and method level. Not sure if people not familiar with your code can easily understand what's going on.
• Please provide a list of all configuration options you use - either in javaDoc or even in the above wiki page.
• In cases where you catch an exception (e.g. PrepareChunks) in a map/reduce job it might make sense to at least increment a counter for each exception that is swallowed.
• You use log.info quite often even within loops - as a general rule of thumb to reduce the amount of information to the absolute minimum and remain able to spot problems I tend to prefer to see an info statement just once in a method
• Please consider re-using the default options provided by Mahout e.g. for input and output paths (IIRC org.apache.mahout.common.commandline)

PrepareChunks:

• Please wrap anything between opening readers/writers etc. in a try{}finally{} block and close the streams in the finally block. Using org.apache.commons.io.IOUtils.closeQuietly might prove useful.
• Don't add code to the patch that is commented out.

ViterbiDataWritable

• There is an array of type Class named classes - either give it a speaking name or at least document the general strategy (or reference the upstream writable strategy that you adopt)

ObservedSequenceWritable

• NullArgumentException - I'd prefer using guava's Precondition concept here for consistency with existing code.
• Maybe take a look at the facilities provided by guava that make it easier to write hashCode/equals and friends?

ForwardViterbiReducer

• I do not really understand why it has so many static functions?

ViterbiEvaluator

• I'd rather not see .* imports

HmmOnlineViterbi

• You are referencing a thesis in the java doc comment - please complete the reference (e.g. title and year are missing)
• The class seems pretty long - maybe factor tree and node into their own separate classes?

OnlineViterbiReducer:

• I did not find more detailed documentation on the expected input/output key/values - might help others to re-use
Show
Hide
Sergey Bartunov added a comment -

I sent the instructions for running my code to Isabel some time ago, but probably someone else would like to run it, so here they are:

1) you need a HmmModel file serialized using LossyHmmSerializer (I'll attach some to this page) and several text files with sequences of observed variables containing numbers from 0 to K-1, if the model has K output states, delimeted by spaces. You may generate these files using "bin/mahout hmmpredict" command. You may also obtain serialized model files by traning a model using "bin/mahout baumwelch"
2) after that you have to divide the data into the chunks by running "bin/mahout hmmchunks". This command is also used for writing decoded data to the text files at the end. Input and output format for the mapreduce code is based on sequence files.
3) run "bin/mahout pviterbi" for parallel viterbi or "bin/mahout poviterbi" for parallel online viterbi
4) decode the results by "bin/mahout hmmchunks -u"

Show
Sergey Bartunov added a comment - I sent the instructions for running my code to Isabel some time ago, but probably someone else would like to run it, so here they are: 1) you need a HmmModel file serialized using LossyHmmSerializer (I'll attach some to this page) and several text files with sequences of observed variables containing numbers from 0 to K-1, if the model has K output states, delimeted by spaces. You may generate these files using "bin/mahout hmmpredict" command. You may also obtain serialized model files by traning a model using "bin/mahout baumwelch" 2) after that you have to divide the data into the chunks by running "bin/mahout hmmchunks". This command is also used for writing decoded data to the text files at the end. Input and output format for the mapreduce code is based on sequence files. 3) run "bin/mahout pviterbi" for parallel viterbi or "bin/mahout poviterbi" for parallel online viterbi 4) decode the results by "bin/mahout hmmchunks -u"
Hide
Sergey Bartunov added a comment -

Patch was obtained against trunk of mahout's git

Show
Sergey Bartunov added a comment - Patch was obtained against trunk of mahout's git
Hide
Sergey Bartunov added a comment -

Uploaded new patch with minor changes and performance chart obtained from project testing.
Test data consist of 8 100-mb sequence files divided into 5 chunks of 20 mb.
The data was generated from 2-state model which require only constant space for decoding. Both Viterbi and Online Viterbi was used to process the data in several modes.
Online Viterbi was also tested with special "offline" model which require O(K*N) space for decoding, but this test failed because of insufficient amount of memory what proofs that O(N*K) space is required. Both "online" and "offline" models could be found in the unit tests for Online Viterbi code.

Testing was performed on the kind of strange 5-node cluster (1 master node, 4 workers) with replication set to all nodes, this and some other questionable configuration parts could affect the testing, but actually it shows the scaling.

Show
Sergey Bartunov added a comment - Uploaded new patch with minor changes and performance chart obtained from project testing. Test data consist of 8 100-mb sequence files divided into 5 chunks of 20 mb. The data was generated from 2-state model which require only constant space for decoding. Both Viterbi and Online Viterbi was used to process the data in several modes. Online Viterbi was also tested with special "offline" model which require O(K*N) space for decoding, but this test failed because of insufficient amount of memory what proofs that O(N*K) space is required. Both "online" and "offline" models could be found in the unit tests for Online Viterbi code. Testing was performed on the kind of strange 5-node cluster (1 master node, 4 workers) with replication set to all nodes, this and some other questionable configuration parts could affect the testing, but actually it shows the scaling.
Hide
Sergey Bartunov added a comment -

Performance chart

Show
Sergey Bartunov added a comment - Performance chart
Hide
Sergey Bartunov added a comment -

Last patch contains finished parallel Viterbi functionality with tests and documentation and also implemented Online Viterbi algorithm (see http://www.dcs.fmph.uniba.sk/diplomovky/obhajene/getfile.php/main.pdf?id=143&fid=289&type=application%2Fpdf) with both sequential and parallel implementations.

Online Viterbi is able to output decoded hidden states as soon as possible (that means it could be used for decoding continuous sequences) and for some concrete models can do it using only a constant-space. Another algorithm's advantage is that it doesn't require backward pass when running it in map-reduce, so there're in a half less map-reduce jobs which means less I/O and better performance.

Seems that it's first implementation of the algorithm in open-source software and that's nice.

Show
Sergey Bartunov added a comment - Last patch contains finished parallel Viterbi functionality with tests and documentation and also implemented Online Viterbi algorithm (see http://www.dcs.fmph.uniba.sk/diplomovky/obhajene/getfile.php/main.pdf?id=143&fid=289&type=application%2Fpdf ) with both sequential and parallel implementations. Online Viterbi is able to output decoded hidden states as soon as possible (that means it could be used for decoding continuous sequences) and for some concrete models can do it using only a constant-space. Another algorithm's advantage is that it doesn't require backward pass when running it in map-reduce, so there're in a half less map-reduce jobs which means less I/O and better performance. Seems that it's first implementation of the algorithm in open-source software and that's nice.
Hide
Sergey Bartunov added a comment -

Work on parallel Viterbi functionality is almost finished, only tests are to complete, and probably some more detailed documentation

Show
Sergey Bartunov added a comment - Work on parallel Viterbi functionality is almost finished, only tests are to complete, and probably some more detailed documentation
Hide
Sergey Bartunov added a comment -

git diff --no-prefix -U5 origin/trunk > viterbi.patch

Show
Sergey Bartunov added a comment - git diff --no-prefix -U5 origin/trunk > viterbi.patch
Hide
Sergey Bartunov added a comment -

Backward pass implementation is complete. It requires a little bit more accurate saving the resulting sequences and the whole code should be refactored a little to propose the patch. All the changes are in the github repository.

I could not perform performance test yet because the cluster I wanted to work on is experiencing some technical issues. I hope I'll perform the test a little bit later.

Show
Sergey Bartunov added a comment - Backward pass implementation is complete. It requires a little bit more accurate saving the resulting sequences and the whole code should be refactored a little to propose the patch. All the changes are in the github repository. I could not perform performance test yet because the cluster I wanted to work on is experiencing some technical issues. I hope I'll perform the test a little bit later.
Hide
Grant Ingersoll added a comment -

Awesome! How goes the testing?

Show
Grant Ingersoll added a comment - Awesome! How goes the testing?
Hide
Sergey Bartunov added a comment -

Forward pass for Viterbi is completed, it's under massive testing right now!

It's implemented in ParallelViterbiDriver. It consists of sequential running mapreduce jobs which get the output of previous chunk processing phase at the input. The main work is in the ForwardViterbiReducer class which calculates backpointers sub-array for the current (sequence, chunk) and stores them into SequenceFile (for backward pass) and also emits the calculated probabilities.

The SequenceFile format was chosen by me as the primary format.

Show
Sergey Bartunov added a comment - Forward pass for Viterbi is completed, it's under massive testing right now! It's implemented in ParallelViterbiDriver. It consists of sequential running mapreduce jobs which get the output of previous chunk processing phase at the input. The main work is in the ForwardViterbiReducer class which calculates backpointers sub-array for the current (sequence, chunk) and stores them into SequenceFile (for backward pass) and also emits the calculated probabilities. The SequenceFile format was chosen by me as the primary format.
Hide
Sergey Bartunov added a comment -

Sorry for the kind of delay, I was sick and could not work on the project. Now I'm catching up on the lost time heatedly. Here is the repository with changes I made while working on the project: https://github.com/sbos/mahout

Here's the notes on what I'd done:
1) during this time I had two exams relevant to the project - machine learning and optimization methods. While preparing for them I thought about my project and found one important mistake - the O((T*L / N) / P) * K^2 + (((T-N) * L / N)) / P * K^3) estimation is absolutely wrong!! It means, that if we perform "difficult processing" on chunks of total length N using P reducers, we will get O (N * K^3 / P), so the algorithm scales linearly. And if we set P=2N for example, we should get sub-N time. Actually, it's not true. No algorithm could process the entire sequence without just doing something on each element. So the time estimation for any such algorithm is max(O(N), O(...)). That means, that "difficult" case implementation is not reasonable and will just occupy the machines.
2) at the same time I'd found this paper http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.159.8361 about paralleling Belief Propagation. It's a more general case analog of Viterbi for loop-less Markov Random Fields. Here I found the strict proof of minimum running time and some algorithm for approximate segmentation called Residual Splash which scales linearly and allows to trade-off between speed and accuracy. I think it will be very cool thing if there will be way to implement it in map-reduce instead of "difficult case". I'll think on it in background.
3) I'd implement a chunk processing utility (PrepareChunk class) in a way other Mahout tools implemented. It takes a set of input files/directories, splits them and writes into SequenceFiles, one for the each chunk processing phase.
4) I tried to initiate a communication with Dhruv to make our projects compatible at interface level and to share some experience in HMM, I hope it will be ok.

Here's the detailed chunk processing scheme now:
1) inputs are split into chunks as [(input_name, chunk_number), observations].
2) for each chunk number map the chunk data and the initial probability data which is formatted as [(input_name, chunk_number), probabilities]. Both observations and probabilities are instances of GenericViterbiData class to make them able to appear in one reduce flow.
3) reducer receives the (input_name, chunk_number), [observations, probabilities] and performs the forward algorithm on the observations with provided initial probabilities (which are taken the previous chunks) and emits (input_name, chunk_number+1), resulting probabilities. This output is saved in the input directory of the next chunk input data. Also it stores the optimal path matrix in the sequence files with the same keys somewhere.
4) after the last chunk was processed by forward algorithm, we may perform the backward algorithm by the same manner collecting the optimal path.

This scheme is much more effective than naive approach I described in the proposal. It reduces I/O and makes the algorithm be able to handle amounts of data that do not fit into memory. The only requirement is that chunk of desired size must fit.

Show
Sergey Bartunov added a comment - Sorry for the kind of delay, I was sick and could not work on the project. Now I'm catching up on the lost time heatedly. Here is the repository with changes I made while working on the project: https://github.com/sbos/mahout Here's the notes on what I'd done: 1) during this time I had two exams relevant to the project - machine learning and optimization methods. While preparing for them I thought about my project and found one important mistake - the O((T*L / N) / P) * K^2 + (((T-N) * L / N)) / P * K^3) estimation is absolutely wrong!! It means, that if we perform "difficult processing" on chunks of total length N using P reducers, we will get O (N * K^3 / P), so the algorithm scales linearly. And if we set P=2N for example, we should get sub-N time. Actually, it's not true. No algorithm could process the entire sequence without just doing something on each element. So the time estimation for any such algorithm is max(O(N), O(...)). That means, that "difficult" case implementation is not reasonable and will just occupy the machines. 2) at the same time I'd found this paper http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.159.8361 about paralleling Belief Propagation. It's a more general case analog of Viterbi for loop-less Markov Random Fields. Here I found the strict proof of minimum running time and some algorithm for approximate segmentation called Residual Splash which scales linearly and allows to trade-off between speed and accuracy. I think it will be very cool thing if there will be way to implement it in map-reduce instead of "difficult case". I'll think on it in background. 3) I'd implement a chunk processing utility (PrepareChunk class) in a way other Mahout tools implemented. It takes a set of input files/directories, splits them and writes into SequenceFiles, one for the each chunk processing phase. 4) I tried to initiate a communication with Dhruv to make our projects compatible at interface level and to share some experience in HMM, I hope it will be ok. Here's the detailed chunk processing scheme now: 1) inputs are split into chunks as [(input_name, chunk_number), observations] . 2) for each chunk number map the chunk data and the initial probability data which is formatted as [(input_name, chunk_number), probabilities] . Both observations and probabilities are instances of GenericViterbiData class to make them able to appear in one reduce flow. 3) reducer receives the (input_name, chunk_number), [observations, probabilities] and performs the forward algorithm on the observations with provided initial probabilities (which are taken the previous chunks) and emits (input_name, chunk_number+1), resulting probabilities. This output is saved in the input directory of the next chunk input data. Also it stores the optimal path matrix in the sequence files with the same keys somewhere. 4) after the last chunk was processed by forward algorithm, we may perform the backward algorithm by the same manner collecting the optimal path. This scheme is much more effective than naive approach I described in the proposal. It reduces I/O and makes the algorithm be able to handle amounts of data that do not fit into memory. The only requirement is that chunk of desired size must fit.

People

• Assignee:
Isabel Drost-Fromm
Reporter:
Sergey Bartunov
0 Vote for this issue
Watchers:
2 Start watching this issue

Dates

• Due:
Created:
Updated:
Resolved: