This is a project proposal for a summer-term university project to write a (sequential) HMM implementation for Mahout. Five students will work on this project as part of a course mentored by Isabel Drost.
Abstract:
Hidden Markov Models are used in multiple areas of Machine Learning, such as speech recognition, handwritten letter recognition or natural language processing. A Hidden Markov Model (HMM) is a statistical model of a process consisting of two (in our case discrete) random variables O and Y, which change their state sequentially. The variable Y with states
{y_1, ... , y_n}
is called the "hidden variable", since its state is not directly observable. The state of Y changes sequentially with a so called - in our case first-order - Markov Property. This means, that the state change probability of Y only depends on its current state and does not change in time. Formally we write: P(Y(t+1)=y_i|Y(0)...Y(t)) = P(Y(t+1)=y_i|Y(t)) = P(Y(2)=y_i|Y(1)). The variable O with states
{o_1, ... , o_m}
is called the "observable variable", since its state can be directly observed. O does not have a Markov Property, but its state propability depends statically on the current state of Y.
Formally, an HMM is defined as a tuple M=(n,m,P,A,B), where n is the number of hidden states, m is the number of observable states, P is an n-dimensional vector containing initial hidden state probabilities, A is the nxn-dimensional "transition matrix" containing the transition probabilities such that A[i,j]=P(Y(t)=y_i|Y(t-1)=y_j) and B is the mxn-dimensional "observation matrix" containing the observation probabilties such that B[i,j]= P(O=o_i|Y=y_j).
Rabiner [1 defined three main problems for HMM models:
Evaluation: Given a sequence O of observations and a model M, what is the probability P(O|M) that sequence O was generated by model M. The Evaluation problem can be efficiently solved using the Forward algorithm
Decoding: Given a sequence O of observations and a model M, what is the most likely sequence Y*=argmax(Y) P(O|M,Y) of hidden variables to generate this sequence. The Decoding problem can be efficiently sovled using the Viterbi algorithm.
Learning: Given a sequence O of observations, what is the most likely model M*=argmax(M)P(O|M) to generate this sequence. The Learning problem can be efficiently solved using the Baum-Welch algorithm.
The target of each milestone is defined as the implementation for the given goals, the respective documentation and unit tests for the implementation.
Timeline
Mid of May 2010 - Mid of July 2010
Milestones
I) Define an HMM class based on Apache Mahout Math package offering interfaces to set model parameters, perform consistency checks, perform output prediction.
1 week from May 18th till May 25th.
II) Write sequential implementations of forward (cf. problem 1 [1) and backward algorithm.
2 weeks from May 25th till June 8th.
III) Write a sequential implementation of Viterbi algorithm (cf. problem 2 [1), based on existing forward algorithm implementation.
2 weeks from June 8th till June 22nd
IV) Have a running sequential implementation of Baum-Welch algorithm for model parameter learning (application II [ref]), based on existing forward/backward algorithm implementation.
2 weeks from June 8th till June 22nd
V) Provide a usage example of HMM implementation, demonstrating all three problems.
2 weeks from June 22nd till July 6th
VI) Finalize documentation and implemenation, clean up open ends.
1 week from July 6th till July 13th
References:
[1 Lawrence R. Rabiner (February 1989). "A tutorial on Hidden Markov Models and selected applications in speech recognition". Proceedings of the IEEE 77 (2): 257-286. doi:10.1109/5.18626.
Hopefully it will be possible to adapt this to a parallel implementation. Since the Baum-Welch algorithm is an instance of an EM algorithm map-reduce implementation should be simple, much as it is with k-means. Moreover, much of the code in a map-reduce implementation would be shared with a sequential version.
Ted Dunning
added a comment -
Great stuff.
Hopefully it will be possible to adapt this to a parallel implementation. Since the Baum-Welch algorithm is an instance of an EM algorithm map-reduce implementation should be simple, much as it is with k-means. Moreover, much of the code in a map-reduce implementation would be shared with a sequential version.
This patch adds the base HMM model class and a prediction mechanism to predict a sequence of output states from A model. Additionally, the stubs for further implementation are added. We still have to add unit tests for the HMM base class, which will be done with the next update (adding forward/backward algorithm and further helper methods).
Max Heimel
added a comment - This patch adds the base HMM model class and a prediction mechanism to predict a sequence of output states from A model. Additionally, the stubs for further implementation are added. We still have to add unit tests for the HMM base class, which will be done with the next update (adding forward/backward algorithm and further helper methods).
Few comments on the code tyle. See Mahout core classes to see the code and naming conventions
1. package org.apache.mahout.sequenceLearning.hmm;
remove capitalization and change to package org.apache.mahout.classifier.sequencelearning.hmm;
2. Use the eclipse code style sheet.
3. java.util.Vector<Integer> outputSequence
dont use Java.vector, kinda confuses with the Mahout vector. For integer arrays use the super-fast o.a.m.math.list.IntArrayList
4. Before you put the final patch, put the Apache License too.
Robin Anil
added a comment - Few comments on the code tyle. See Mahout core classes to see the code and naming conventions
1. package org.apache.mahout.sequenceLearning.hmm;
remove capitalization and change to package org.apache.mahout.classifier.sequencelearning.hmm;
2. Use the eclipse code style sheet.
3. java.util.Vector<Integer> outputSequence
dont use Java.vector, kinda confuses with the Mahout vector. For integer arrays use the super-fast o.a.m.math.list.IntArrayList
4. Before you put the final patch, put the Apache License too.
Worked in the suggestions made by Robin Anil (move to classifier/sequencelearning/hmm, use eclipe code style sheet, add apache licese, use IntArrayList instead of java.util.vector<Integer>).
Max Heimel
added a comment - Worked in the suggestions made by Robin Anil (move to classifier/sequencelearning/hmm, use eclipe code style sheet, add apache licese, use IntArrayList instead of java.util.vector<Integer>).
With this patch, the HMM implementation contains the forward and backward algorithm plus unit tests. There might however still be interface changes at how the HMMAlgorithms Object is handled...
Max Heimel
added a comment - With this patch, the HMM implementation contains the forward and backward algorithm plus unit tests. There might however still be interface changes at how the HMMAlgorithms Object is handled...
I just submitted the latest HMM patch, which contains implementations for the forward/backward and Viterbi algorithm. This allows solving the problems of decoding and evaluatiing a Hidden Markov model. The next step on the HMM Agenda will be the Baum-Welch algorithm to tackle the learning problem.
The open to-dos are:
1) Implement the Baum-Welch algorithm and write unit tests for the implementation. (I hope to finish this task by next week... )
2) Find real-world examples that make use of the HMM implementation to demonstrate both usage and usefullness (where to find good data sets?)
3) Implement methods for serializing/deserializing an HMMModel (to file? to string? both?)
4) Write logarithmically scaled versions of the Hmm related algorithms to better handle numeric issues with very low probabiliites.
5) Think about adding interfaces to the HmmEvaluator methods, that accept state names instead of state numbers to provide more user convenience in using the HMM implementation.
Do you have any further remarks, comments, suggestions?
Max Heimel
added a comment - I just submitted the latest HMM patch, which contains implementations for the forward/backward and Viterbi algorithm. This allows solving the problems of decoding and evaluatiing a Hidden Markov model. The next step on the HMM Agenda will be the Baum-Welch algorithm to tackle the learning problem.
The open to-dos are:
1) Implement the Baum-Welch algorithm and write unit tests for the implementation. (I hope to finish this task by next week... )
2) Find real-world examples that make use of the HMM implementation to demonstrate both usage and usefullness (where to find good data sets?)
3) Implement methods for serializing/deserializing an HMMModel (to file? to string? both?)
4) Write logarithmically scaled versions of the Hmm related algorithms to better handle numeric issues with very low probabiliites.
5) Think about adding interfaces to the HmmEvaluator methods, that accept state names instead of state numbers to provide more user convenience in using the HMM implementation.
Do you have any further remarks, comments, suggestions?
Marc and I will do the 2. task:
Find real-world examples that make use of the HMM implementation to demonstrate both usage and usefullness
Qiuyan Xu
added a comment - Marc and I will do the 2. task:
Find real-world examples that make use of the HMM implementation to demonstrate both usage and usefullness
Patch containing the full HMM implementation. This includes implementations of forward, backward, Viterbi algorithm and three learning algorithms (supervised, Viterbi, Baum-Welch). All algorithms are available in a normal and log-scaled variant. The HMM model can now be exported to JSON.
This patch also contains a small demo application that implements a POS tagger using the HMM implementation. Using the test/training data sets from http://flexcrfs.sourceforge.net/#Case_Study this demo-application achieves a tagging accuracy of 94% by applying supervised learning. The learning and tagging process takes less than a second on a 2,6Ghz AMD dual core processor with 5Gb of RAM running Ubuntu 10.04.
Max Heimel
added a comment - Patch containing the full HMM implementation. This includes implementations of forward, backward, Viterbi algorithm and three learning algorithms (supervised, Viterbi, Baum-Welch). All algorithms are available in a normal and log-scaled variant. The HMM model can now be exported to JSON.
This patch also contains a small demo application that implements a POS tagger using the HMM implementation. Using the test/training data sets from http://flexcrfs.sourceforge.net/#Case_Study this demo-application achieves a tagging accuracy of 94% by applying supervised learning. The learning and tagging process takes less than a second on a 2,6Ghz AMD dual core processor with 5Gb of RAM running Ubuntu 10.04.
Patch containing the full HMM implementation. This includes implementations of forward, backward, Viterbi algorithm and three learning algorithms (supervised, Viterbi, Baum-Welch). All algorithms are available in a normal and log-scaled variant. The HMM model can now be exported to JSON.
This patch also contains a small demo application that implements a POS tagger using the HMM implementation. Using the test/training data sets from http://flexcrfs.sourceforge.net/#Case_Study this demo-application achieves a tagging accuracy of 94% by applying supervised learning. The learning and tagging process takes less than a second on a 2,6Ghz AMD dual core processor with 5Gb of RAM running Ubuntu 10.04.
Max Heimel
added a comment - Patch containing the full HMM implementation. This includes implementations of forward, backward, Viterbi algorithm and three learning algorithms (supervised, Viterbi, Baum-Welch). All algorithms are available in a normal and log-scaled variant. The HMM model can now be exported to JSON.
This patch also contains a small demo application that implements a POS tagger using the HMM implementation. Using the test/training data sets from http://flexcrfs.sourceforge.net/#Case_Study this demo-application achieves a tagging accuracy of 94% by applying supervised learning. The learning and tagging process takes less than a second on a 2,6Ghz AMD dual core processor with 5Gb of RAM running Ubuntu 10.04.
The latest patch contains the finished implementation and test suite of sequential Hidden Markov Models for Apache Mahout and should be ready for review.
The implementation now offers all the required functionality (initialization, evaluation, training, loading/storing) and should work fairly efficient. All algorithms are implemented to optionally use log-likelihoods to trade off computation time for increased numerical stability in case of long output sequences. Models can be serialized and deserialized from Json. There are three training algorithms tackling different usage scenarios (supervised, Viterbi, Baum-Welch) available.
I included a small demo-program (org.apache.mahout.classifier.sequencelearning.hmm.example.PosTagger.java) that demonstrates usage of the model class by implementing a simple PartOfSpeech-tagger using the training/test data sets from http://flexcrfs.sourceforge.net/#Case_Study. Using simple supervised learning on the training data set & assuming unknown words to be of type PNP (proper noun present), the tagger reaches a test accuracy of 94%. The program output with timings on my Athlon 2.6Ghz dual core with 4GB of RAM running Ubuntu 10.04 is:
"Using URL: http://www.jaist.ac.jp/~hieuxuan/flexcrfs/CoNLL2000-NP/train.txt
Reading and parsing training data file ... done in 22.479 seconds!
Read 211727 lines containing 8936 sentences with a total of 19122 distinct words and 43 distinct POS tags.
Training HMM model ... done in 0.719 seconds!
Reading and parsing test data file ... done in 3.189 seconds!
Read 47377 lines containing 2012 sentences.
POS tagging test data file ... done in 0.475 seconds!
Tagged the test file with an error rate of: 0.060176879076345065"
There are still some open ends that need to be looked into. The major points are:
1) Redesigning HmmAlgorithms to handle SparseMatrix/SparseVector more efficiently (via nonZeroIterator)
2) Serializing and Deserializing of a model is currently possible using JSON. However, for large models this becomes highly inefficient. E.g. serializing the model created in the PosTagging example results in an 18MB JSO file that takes much longer to deserialze than a reconstruction from the training data takes. It would thus probably be a good idea to look into binary serialization/deserialization.
3) Convergence check for Viterbi/Baum-Welch: currently the convergenc check uses an "ad-hoc" matrix difference. Since BW is an EM algorithm, it would probably be mathematically more sound to switch to using the model likelihood for the convergence check.
4) Parallelization: The traditional algorithms (forward,backward,viterbi,baum-welch) are probably hard to parallelize using M/R - there is some prior work regarding parallelization, but I haven't quite looked into this yet. A more suitable approach to HMM parallelization would be to train in parallel on multiple sequences (e.g. one per mapper) and then merge the resulting HMMs in the reducer step.
Max Heimel
added a comment - The latest patch contains the finished implementation and test suite of sequential Hidden Markov Models for Apache Mahout and should be ready for review.
The implementation now offers all the required functionality (initialization, evaluation, training, loading/storing) and should work fairly efficient. All algorithms are implemented to optionally use log-likelihoods to trade off computation time for increased numerical stability in case of long output sequences. Models can be serialized and deserialized from Json. There are three training algorithms tackling different usage scenarios (supervised, Viterbi, Baum-Welch) available.
I included a small demo-program (org.apache.mahout.classifier.sequencelearning.hmm.example.PosTagger.java) that demonstrates usage of the model class by implementing a simple PartOfSpeech-tagger using the training/test data sets from http://flexcrfs.sourceforge.net/#Case_Study . Using simple supervised learning on the training data set & assuming unknown words to be of type PNP (proper noun present), the tagger reaches a test accuracy of 94%. The program output with timings on my Athlon 2.6Ghz dual core with 4GB of RAM running Ubuntu 10.04 is:
"Using URL: http://www.jaist.ac.jp/~hieuxuan/flexcrfs/CoNLL2000-NP/train.txt
Reading and parsing training data file ... done in 22.479 seconds!
Read 211727 lines containing 8936 sentences with a total of 19122 distinct words and 43 distinct POS tags.
Training HMM model ... done in 0.719 seconds!
Reading and parsing test data file ... done in 3.189 seconds!
Read 47377 lines containing 2012 sentences.
POS tagging test data file ... done in 0.475 seconds!
Tagged the test file with an error rate of: 0.060176879076345065"
There are still some open ends that need to be looked into. The major points are:
1) Redesigning HmmAlgorithms to handle SparseMatrix/SparseVector more efficiently (via nonZeroIterator)
2) Serializing and Deserializing of a model is currently possible using JSON. However, for large models this becomes highly inefficient. E.g. serializing the model created in the PosTagging example results in an 18MB JSO file that takes much longer to deserialze than a reconstruction from the training data takes. It would thus probably be a good idea to look into binary serialization/deserialization.
3) Convergence check for Viterbi/Baum-Welch: currently the convergenc check uses an "ad-hoc" matrix difference. Since BW is an EM algorithm, it would probably be mathematically more sound to switch to using the model likelihood for the convergence check.
4) Parallelization: The traditional algorithms (forward,backward,viterbi,baum-welch) are probably hard to parallelize using M/R - there is some prior work regarding parallelization, but I haven't quite looked into this yet. A more suitable approach to HMM parallelization would be to train in parallel on multiple sequences (e.g. one per mapper) and then merge the resulting HMMs in the reducer step.
We've made another test over the same training data and test data with a statistic approach. For each word which also exists in the training data, we assign it with the corresponding Pos tag, that appeared most frequently. For all unknown words, we assign it with the type NNP. The precision turned out to be 91.7%.
Qiuyan Xu
added a comment - We've made another test over the same training data and test data with a statistic approach. For each word which also exists in the training data, we assign it with the corresponding Pos tag, that appeared most frequently. For all unknown words, we assign it with the type NNP. The precision turned out to be 91.7%.
Patch applies cleanly with "-p1" (was generated from a git clone of Mahout), builds, all tests green after applying it.
Some comments after taking an initial look at the code:
Please don't use "System.out.println" anywhere in the code - use Loggers instead.
I still see quite a few int Arrays in the example code. Could you please provide some explanation of why you did not use the o.a.m.math.list.IntArrayList as proposed by Robin?
Rest of the code looks to me otherwise - though I am to be considered highly subjective in that matter. Would be glad if a second Mahout developer had time to have a look.
Just for the record: The current implementation is sequential-only. During the past few weeks I have come accross a few publications that might be interesting for follow-up work: "Scaling the iHMM: Parallelization versus Hadoop". In addition there seem the have been works in the direction of spectral learning algorithms for HMMs that might be interesting: "Hilbert Space Embeddings of Hidden Markov Models (L. Song, B. Boots, S. Saddiqi, G. Gordon, A. Smola)" and "A spectral algorithm for learning hidden Markov models (D. Hsu, S. Kakade, T. Zhang)"
Isabel Drost-Fromm
added a comment - Patch applies cleanly with "-p1" (was generated from a git clone of Mahout), builds, all tests green after applying it.
Some comments after taking an initial look at the code:
Please don't use "System.out.println" anywhere in the code - use Loggers instead.
I still see quite a few int Arrays in the example code. Could you please provide some explanation of why you did not use the o.a.m.math.list.IntArrayList as proposed by Robin?
Rest of the code looks to me otherwise - though I am to be considered highly subjective in that matter. Would be glad if a second Mahout developer had time to have a look.
Just for the record: The current implementation is sequential-only. During the past few weeks I have come accross a few publications that might be interesting for follow-up work: "Scaling the iHMM: Parallelization versus Hadoop". In addition there seem the have been works in the direction of spectral learning algorithms for HMMs that might be interesting: "Hilbert Space Embeddings of Hidden Markov Models (L. Song, B. Boots, S. Saddiqi, G. Gordon, A. Smola)" and "A spectral algorithm for learning hidden Markov models (D. Hsu, S. Kakade, T. Zhang)"
Sorry for the long delay, I have some exams coming up, so I had to focus on learning
Here are my comments reagarding your review (I will comment on the paralleization papers, once I read through them
I have fixed the issue with using System.out by replacing it by calls to LOG.info( .. ) in the PosTagger example.
I have fixed some build errors within the unit tests that must have come up since you reviewed the code.
I moved the PosTagger example to mahout-examples subproject
Regarding the IntArrayList: There actually was a problem I encountered which made me switch to int[] - however I cannot recall it. Shame on me However, since int[] only gets used in places where the array size is guaranteed not to change, and since you can easily get an int[] from an IntArray, I actually don't see an advantage of using IntArray over int[]. If I am assessing this wrong, please let me know and I will look into replacing the int[] with IntArray objects.
I have attached the latest patch.
Max Heimel
added a comment - Sorry for the long delay, I have some exams coming up, so I had to focus on learning
Here are my comments reagarding your review (I will comment on the paralleization papers, once I read through them
I have fixed the issue with using System.out by replacing it by calls to LOG.info( .. ) in the PosTagger example.
I have fixed some build errors within the unit tests that must have come up since you reviewed the code.
I moved the PosTagger example to mahout-examples subproject
Regarding the IntArrayList: There actually was a problem I encountered which made me switch to int[] - however I cannot recall it. Shame on me However, since int[] only gets used in places where the array size is guaranteed not to change, and since you can easily get an int[] from an IntArray, I actually don't see an advantage of using IntArray over int[]. If I am assessing this wrong, please let me know and I will look into replacing the int[] with IntArray objects.
I have attached the latest patch.
One more sweep over the patch to remove style issues: Fixed indent, a few variable names, some typos in the documentation, added some documentation where missing, removed unused imports/variables.
Max, please be so kind to double-check that in the course of code cleanup nothing broke and especially that documentation is still correct. I marked one particular method with a TODO that had an unclear comment.
Sean, as you have done so much work in recent weeks to get our code base clean - it would be really great, if you could double-check that I neither missed a problem nor introduced new ones - tried to stick with our checkstyle coding conventions, findbugs configuration and the additional checks that intellij has built-in.
Isabel Drost-Fromm
added a comment - Patch should be applied with -p1 set.
One more sweep over the patch to remove style issues: Fixed indent, a few variable names, some typos in the documentation, added some documentation where missing, removed unused imports/variables.
Max, please be so kind to double-check that in the course of code cleanup nothing broke and especially that documentation is still correct. I marked one particular method with a TODO that had an unclear comment.
Sean, as you have done so much work in recent weeks to get our code base clean - it would be really great, if you could double-check that I neither missed a problem nor introduced new ones - tried to stick with our checkstyle coding conventions, findbugs configuration and the additional checks that intellij has built-in.
Hudson
added a comment - Integrated in Mahout-Quality #319 (See https://hudson.apache.org/hudson/job/Mahout-Quality/319/ )
MAHOUT-396 - add HMM support for sequence classification. Thanks to Max
Heimel, Marc Hofer, Qiuyan Xu and Van Long Nguyen for contributing the
patch.
People
Max Heimel
Max Heimel
Votes:
0Vote for this issue
Watchers:
0Start watching this issue
Dates
Created:
Updated:
Resolved:
{"report":{"fcp":4104.100000023842,"ttfb":933.1999999284744,"pageVisibility":"visible","entityId":12464627,"key":"jira.project.issue.view-issue","isInitial":true,"threshold":1000,"elementTimings":{},"userDeviceMemory":8,"userDeviceProcessors":16,"apdex":0,"journeyId":"a7edb01c-705b-4d1b-adde-f2303ff752d8","navigationType":0,"readyForUser":4352.299999952316,"redirectCount":0,"resourceLoadedEnd":3951.600000023842,"resourceLoadedStart":940.3999999761581,"resourceTiming":[{"duration":401.39999997615814,"initiatorType":"link","name":"https://issues.apache.org/jira/s/b62489a2eaac59d9b8a093c1a51d034f-CDN/xd97tr/820010/13pdxe5/49fa3aa3d35a2cc689cbf274e66cc41a/_/download/contextbatch/css/_super/batch.css","startTime":940.3999999761581,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":940.3999999761581,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":1341.7999999523163,"responseStart":0,"secureConnectionStart":0},{"duration":401.5,"initiatorType":"link","name":"https://issues.apache.org/jira/s/56490edcf9d54e35149505f78cca6a47-CDN/xd97tr/820010/13pdxe5/72cb823bcc50211a60c1ebe830467cae/_/download/contextbatch/css/jira.browse.project,jira.view.issue,project.issue.navigator,atl.general,atl.global,jira.global,jira.general,-_super/batch.css?agile_global_admin_condition=true&jag=true&jira.create.linked.issue=true&richediton=true&slack-enabled=true","startTime":940.6999999284744,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":940.6999999284744,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":1342.1999999284744,"responseStart":0,"secureConnectionStart":0},{"duration":1259.3000000715256,"initiatorType":"script","name":"https://issues.apache.org/jira/s/5263129088916436ab9aeb2417075b3f-CDN/xd97tr/820010/13pdxe5/49fa3aa3d35a2cc689cbf274e66cc41a/_/download/contextbatch/js/_super/batch.js?locale=en-UK","startTime":940.7999999523163,"connectEnd":940.7999999523163,"connectStart":940.7999999523163,"domainLookupEnd":940.7999999523163,"domainLookupStart":940.7999999523163,"fetchStart":940.7999999523163,"redirectEnd":0,"redirectStart":0,"requestStart":1345,"responseEnd":2200.100000023842,"responseStart":1486.5,"secureConnectionStart":940.7999999523163},{"duration":2254.1999999284744,"initiatorType":"script","name":"https://issues.apache.org/jira/s/611c208bd094adb71a6f4f3e7f6fff3d-CDN/xd97tr/820010/13pdxe5/72cb823bcc50211a60c1ebe830467cae/_/download/contextbatch/js/jira.browse.project,jira.view.issue,project.issue.navigator,atl.general,atl.global,jira.global,jira.general,-_super/batch.js?agile_global_admin_condition=true&jag=true&jira.create.linked.issue=true&locale=en-UK&richediton=true&slack-enabled=true","startTime":941,"connectEnd":1701.5,"connectStart":1344.7999999523163,"domainLookupEnd":1344.7999999523163,"domainLookupStart":1344.7999999523163,"fetchStart":941,"redirectEnd":0,"redirectStart":0,"requestStart":1703,"responseEnd":3195.1999999284744,"responseStart":1859.6999999284744,"secureConnectionStart":1486.2999999523163},{"duration":885.5,"initiatorType":"script","name":"https://issues.apache.org/jira/s/d41d8cd98f00b204e9800998ecf8427e-CDN/xd97tr/820010/13pdxe5/1.0/_/download/batch/jira.webresources:calendar-en/jira.webresources:calendar-en.js","startTime":941.1999999284744,"connectEnd":1702.1000000238419,"connectStart":1347.6000000238419,"domainLookupEnd":1347.6000000238419,"domainLookupStart":1347.6000000238419,"fetchStart":941.1999999284744,"redirectEnd":0,"redirectStart":0,"requestStart":1703.1000000238419,"responseEnd":1826.6999999284744,"responseStart":1824.8999999761581,"secureConnectionStart":1486},{"duration":1041.2999999523163,"initiatorType":"script","name":"https://issues.apache.org/jira/s/d41d8cd98f00b204e9800998ecf8427e-CDN/xd97tr/820010/13pdxe5/1.0/_/download/batch/jira.webresources:calendar-localisation-moment/jira.webresources:calendar-localisation-moment.js","startTime":941.3999999761581,"connectEnd":1826.6000000238419,"connectStart":1536.7999999523163,"domainLookupEnd":1536.7999999523163,"domainLookupStart":1536.7999999523163,"fetchStart":941.3999999761581,"redirectEnd":0,"redirectStart":0,"requestStart":1826.8999999761581,"responseEnd":1982.6999999284744,"responseStart":1979.5,"secureConnectionStart":1701.5},{"duration":406.2999999523163,"initiatorType":"link","name":"https://issues.apache.org/jira/s/981f587853769311cda7c3b845131a06-CDN/xd97tr/820010/13pdxe5/cb5a5495a038c0744457f25821ba9ee8/_/download/contextbatch/css/jira.global.look-and-feel,-_super/batch.css","startTime":941.5,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":941.5,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":1347.7999999523163,"responseStart":0,"secureConnectionStart":0},{"duration":1041.2000000476837,"initiatorType":"script","name":"https://issues.apache.org/jira/rest/api/1.0/shortcuts/820010/5840efff50357da9055d4714dc0713f/shortcuts.js?context=issuenavigation&context=issueaction","startTime":941.6999999284744,"connectEnd":1826.3999999761581,"connectStart":1536.7999999523163,"domainLookupEnd":1536.7999999523163,"domainLookupStart":1536.7999999523163,"fetchStart":941.6999999284744,"redirectEnd":0,"redirectStart":0,"requestStart":1826.8999999761581,"responseEnd":1982.8999999761581,"responseStart":1978.6000000238419,"secureConnectionStart":1701.7999999523163},{"duration":554.3000000715256,"initiatorType":"link","name":"https://issues.apache.org/jira/s/3ac36323ba5e4eb0af2aa7ac7211b4bb-CDN/xd97tr/820010/13pdxe5/efa42a25652b26dfd802540c024826b3/_/download/contextbatch/css/com.atlassian.jira.projects.sidebar.init,-_super,-jira.view.issue,-project.issue.navigator/batch.css?jira.create.linked.issue=true&richediton=true","startTime":991.1999999284744,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":991.1999999284744,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":1545.5,"responseStart":0,"secureConnectionStart":0},{"duration":991.3999999761581,"initiatorType":"script","name":"https://issues.apache.org/jira/s/efa8931cd5ac13ed95c56ca8a1dc1967-CDN/xd97tr/820010/13pdxe5/efa42a25652b26dfd802540c024826b3/_/download/contextbatch/js/com.atlassian.jira.projects.sidebar.init,-_super,-jira.view.issue,-project.issue.navigator/batch.js?jira.create.linked.issue=true&locale=en-UK&richediton=true","startTime":991.5,"connectEnd":991.5,"connectStart":991.5,"domainLookupEnd":991.5,"domainLookupStart":991.5,"fetchStart":991.5,"redirectEnd":0,"redirectStart":0,"requestStart":1827,"responseEnd":1982.8999999761581,"responseStart":1980,"secureConnectionStart":991.5},{"duration":1295.0999999046326,"initiatorType":"script","name":"https://issues.apache.org/jira/s/d41d8cd98f00b204e9800998ecf8427e-CDN/xd97tr/820010/13pdxe5/1.0/_/download/batch/jira.webresources:bigpipe-js/jira.webresources:bigpipe-js.js","startTime":1064.6000000238419,"connectEnd":1064.6000000238419,"connectStart":1064.6000000238419,"domainLookupEnd":1064.6000000238419,"domainLookupStart":1064.6000000238419,"fetchStart":1064.6000000238419,"redirectEnd":0,"redirectStart":0,"requestStart":2222.7999999523163,"responseEnd":2359.6999999284744,"responseStart":2358.100000023842,"secureConnectionStart":1064.6000000238419},{"duration":2886.9000000953674,"initiatorType":"script","name":"https://issues.apache.org/jira/s/d41d8cd98f00b204e9800998ecf8427e-CDN/xd97tr/820010/13pdxe5/1.0/_/download/batch/jira.webresources:bigpipe-init/jira.webresources:bigpipe-init.js","startTime":1064.6999999284744,"connectEnd":1064.6999999284744,"connectStart":1064.6999999284744,"domainLookupEnd":1064.6999999284744,"domainLookupStart":1064.6999999284744,"fetchStart":1064.6999999284744,"redirectEnd":0,"redirectStart":0,"requestStart":3811.100000023842,"responseEnd":3951.600000023842,"responseStart":3951,"secureConnectionStart":1064.6999999284744},{"duration":1434.3000000715256,"initiatorType":"xmlhttprequest","name":"https://issues.apache.org/jira/rest/webResources/1.0/resources","startTime":2506.6999999284744,"connectEnd":2506.6999999284744,"connectStart":2506.6999999284744,"domainLookupEnd":2506.6999999284744,"domainLookupStart":2506.6999999284744,"fetchStart":2506.6999999284744,"redirectEnd":0,"redirectStart":0,"requestStart":3823.100000023842,"responseEnd":3941,"responseStart":3940.399999976158,"secureConnectionStart":2506.6999999284744}],"fetchStart":0,"domainLookupStart":540,"domainLookupEnd":558,"connectStart":558,"connectEnd":783,"secureConnectionStart":669,"requestStart":783,"responseStart":933,"responseEnd":1063,"domLoading":939,"domInteractive":4625,"domContentLoadedEventStart":4625,"domContentLoadedEventEnd":4710,"domComplete":5894,"loadEventStart":5894,"loadEventEnd":5897,"userAgent":"Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)","marks":[{"name":"bigPipe.sidebar-id.start","time":4442},{"name":"bigPipe.sidebar-id.end","time":4442.899999976158},{"name":"bigPipe.activity-panel-pipe-id.start","time":4443.100000023842},{"name":"bigPipe.activity-panel-pipe-id.end","time":4449.100000023842},{"name":"activityTabFullyLoaded","time":4735}],"measures":[],"correlationId":"580de957970841","effectiveType":"4g","downlink":10,"rtt":0,"serverDuration":137,"dbReadsTimeInMs":3,"dbConnsTimeInMs":13,"applicationHash":"ace47f9899e9ee25d7157d59aa17ab06aee30d3d","experiments":[]}}
Great stuff.
Hopefully it will be possible to adapt this to a parallel implementation. Since the Baum-Welch algorithm is an instance of an EM algorithm map-reduce implementation should be simple, much as it is with k-means. Moreover, much of the code in a map-reduce implementation would be shared with a sequential version.