UIMA
  1. UIMA
  2. UIMA-2110

Turn the HMMTagger class into a more generic class for tagging tasks

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.3
    • Fix Version/s: 2.3.1Addons
    • Component/s: Sandbox-Tagger
    • Labels:
      None
    • Environment:

      Description

      Despite its name, the code of the org.apache.uima.examples.tagger.HMMTagger
      class is not totally independant from the pos tagging task.
      In addition it assumes that the feature path to update with the result of the
      tagging is org.apache.uima.TokenAnnotation:posTag.

      We propose to let the possibility to users to specify by parameter the feature
      path to set. This parameter is optional. If it is left free, the tagger will
      work as usually using the org.apache.uima.TokenAnnotation:posTag as default value.

      By the way, we propose to add three optional parameters : InputView, SentenceType and ModelFile.
      Since the HMM Learner has got the possibility to specify the view to use to
      train a model, we consequently decide to give the same possibility for the
      tagger. By default, it works on the _InitialView. It is actually quite useful in practice!

      The org.apache.uima.TokenAnnotation type is not the only annotation type which is assumed
      to be present in the CAS. Actually, the HMMTagger processes tokens sentence by sentence. It uses the
      org.apache.uima.SentenceAnnotation to select the tokens. The SentenceType parameter aims at
      letting the users free to specify their own sentence annotation Type. The default value is
      org.apache.uima.SentenceAnnotation.

      The ModelFile parameter is a concurrent way to the resource declaration way to specify a model.
      Left empty, it won t be considered. Otherwise it will predomine over the resource declaration.
      When specified, the multiple deployement of the tagger cannot be allowed but in practice for the user it may be easier to configure a parameter through Eclipse.

      Two distincts patches will be provided, one for the class and the other for the descriptor.

      Future improvement of the class might offer the possibility to create new annotations not only to update existing ones.
      Future improvement of the descriptor may dissociate what it is up to the tagger and what it is relevant for the pos tagger...

      1. AMoreGenericHMMTaggerDesc.patch
        11 kB
        Nicolas Hernandez
      2. AMoreGenericHMMTaggerSrcClass.patch
        9 kB
        Nicolas Hernandez
      3. UIMA2110updated.patch
        14 kB
        Tommaso Teofili

        Activity

        Marshall Schor made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Tommaso Teofili made changes -
        Status Reopened [ 4 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Tommaso Teofili made changes -
        Resolution Fixed [ 1 ]
        Status Resolved [ 5 ] Reopened [ 4 ]
        Hide
        Tommaso Teofili added a comment -

        missing license header in new descriptor

        Show
        Tommaso Teofili added a comment - missing license header in new descriptor
        Tommaso Teofili made changes -
        Fix Version/s 2.3.1Addons [ 12316093 ]
        Tommaso Teofili made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Assignee Tommaso Teofili [ teofili ]
        Resolution Fixed [ 1 ]
        Hide
        Tommaso Teofili added a comment -

        Thanks Nicolas for contributing this

        Show
        Tommaso Teofili added a comment - Thanks Nicolas for contributing this
        Hide
        Tommaso Teofili added a comment -

        I tested it and verified there is no regression when using this new version the "old way" too.
        I'm going to commit it shortly.

        Show
        Tommaso Teofili added a comment - I tested it and verified there is no regression when using this new version the "old way" too. I'm going to commit it shortly.
        Tommaso Teofili made changes -
        Attachment UIMA2110updated.patch [ 12485127 ]
        Hide
        Tommaso Teofili added a comment -

        I updated the patch, tests run correctly, now I am going to test this patch in a running system

        Show
        Tommaso Teofili added a comment - I updated the patch, tests run correctly, now I am going to test this patch in a running system
        Hide
        Nicolas Hernandez added a comment -

        More information about the process to build the models can be found here http://enicolashernandez.blogspot.com/2011/05/construire-des-modelisations-du-french.html
        (accidentally it is in French)

        By the way, what about the current submission? Would it have been better to dissociate the submission of the various generic parameters ? For example, on the one hand, the ones which handle the view, the sentence type and the feature path of the annotation to create by tagging, and on the other hand the process to manage models by parameter.

        Let me know

        Show
        Nicolas Hernandez added a comment - More information about the process to build the models can be found here http://enicolashernandez.blogspot.com/2011/05/construire-des-modelisations-du-french.html (accidentally it is in French) By the way, what about the current submission? Would it have been better to dissociate the submission of the various generic parameters ? For example, on the one hand, the ones which handle the view, the sentence type and the feature path of the annotation to create by tagging, and on the other hand the process to manage models by parameter. Let me know
        Tommaso Teofili made changes -
        Fix Version/s 2.3.1 [ 12314751 ]
        Fix Version/s 2.3.1Addons [ 12316093 ]
        Hide
        Tommaso Teofili added a comment -

        Thanks Nicolas for the detailed explanation, looking forward to read more information on how you created the resources. In the meantime I will test the HMMTagger basic usages and the advanced training capability on top of the attached patches.

        Show
        Tommaso Teofili added a comment - Thanks Nicolas for the detailed explanation, looking forward to read more information on how you created the resources. In the meantime I will test the HMMTagger basic usages and the advanced training capability on top of the attached patches.
        Hide
        Nicolas Hernandez added a comment -

        Hi Tommaso

        Yes we actually used the HMMTagger to train some models. We used the French Treebank (FTB) for that http://www.llf.cnrs.fr/Gens/Abeille/French-Treebank-fr.php

        We obtained French models for tagging pos, morphological information and also lemma.
        And it works fine !!! (The HMM tagger for predicting lemma is probably not a judicious choice but we test it since the processing chain and the data was available). The FTB offers some secondary information we did not test yet.

        From the user point of view who knows that his task can be solve by a HMM but who does not want to know how, the HMM trainer and tagger are really easy to use. For all the other cases, ClearTk is probably a better solution, but it requires development skills and takes more time to get in.
        Indeed the current HMM trainer implementation uses a few features (it uses n-grams, suffix, lower/upercase text in some configurations), ClearTk offers much more configurable features.

        About the resources we produced. So far, the license attached to the FTB is unclear for the distribution of the models we can train with. We are not sure to be able to release then under the Apache License. Our attempt to obtain the right from the authors of the corpus dit not come off yet.

        Nearly, I will blog post the procedure to create the resources so that anyone will be able to do it themself. I used a couple of nice AEs: one to turn into CAS annotations any XML structure and one to map any annotation to another depending on some constraint declarations. The latter is already released under Apache license, the former will be quite soon.
        I will also release the models with respect of the corpus license which allows use of the corpus for research purpose.

        Show
        Nicolas Hernandez added a comment - Hi Tommaso Yes we actually used the HMMTagger to train some models. We used the French Treebank (FTB) for that http://www.llf.cnrs.fr/Gens/Abeille/French-Treebank-fr.php We obtained French models for tagging pos, morphological information and also lemma. And it works fine !!! (The HMM tagger for predicting lemma is probably not a judicious choice but we test it since the processing chain and the data was available). The FTB offers some secondary information we did not test yet. From the user point of view who knows that his task can be solve by a HMM but who does not want to know how, the HMM trainer and tagger are really easy to use. For all the other cases, ClearTk is probably a better solution, but it requires development skills and takes more time to get in. Indeed the current HMM trainer implementation uses a few features (it uses n-grams, suffix, lower/upercase text in some configurations), ClearTk offers much more configurable features. About the resources we produced. So far, the license attached to the FTB is unclear for the distribution of the models we can train with. We are not sure to be able to release then under the Apache License. Our attempt to obtain the right from the authors of the corpus dit not come off yet. Nearly, I will blog post the procedure to create the resources so that anyone will be able to do it themself. I used a couple of nice AEs: one to turn into CAS annotations any XML structure and one to map any annotation to another depending on some constraint declarations. The latter is already released under Apache license, the former will be quite soon. I will also release the models with respect of the corpus license which allows use of the corpus for research purpose.
        Tommaso Teofili made changes -
        Fix Version/s 2.3.1Addons [ 12316093 ]
        Hide
        Tommaso Teofili added a comment -

        Hi Nicolas,
        this is a nice improvement which gives HMMTagger a wider range of applications, did you try to train/use models for other than PoS tagging with this patch? I'd be curious to know what tests have been done and what were the results

        Show
        Tommaso Teofili added a comment - Hi Nicolas, this is a nice improvement which gives HMMTagger a wider range of applications, did you try to train/use models for other than PoS tagging with this patch? I'd be curious to know what tests have been done and what were the results
        Nicolas Hernandez made changes -
        Attachment AMoreGenericHMMTaggerDesc.patch [ 12475625 ]
        Hide
        Nicolas Hernandez added a comment -

        A patch for the HMMTagger descriptor (with new parameters definition)

        Show
        Nicolas Hernandez added a comment - A patch for the HMMTagger descriptor (with new parameters definition)
        Nicolas Hernandez made changes -
        Field Original Value New Value
        Attachment AMoreGenericHMMTaggerSrcClass.patch [ 12475624 ]
        Hide
        Nicolas Hernandez added a comment -

        A patch to make more generic the HMMTagger.java

        Show
        Nicolas Hernandez added a comment - A patch to make more generic the HMMTagger.java
        Nicolas Hernandez created issue -

          People

          • Assignee:
            Tommaso Teofili
            Reporter:
            Nicolas Hernandez
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 1.5h
              1.5h
              Remaining:
              Remaining Estimate - 1.5h
              1.5h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development