Uploaded image for project: 'Stanbol'
  1. Stanbol
  2. STANBOL-733 Stanbol NLP processing
  3. STANBOL-739

Migrate the Celi Lemmatizer Engine to use the AnalyzedText contentPart

    Details

      Description

      The CELI Lemmatizer enhancement engine currently writes its results directly to the metadata of the ContentItem. As the new AnalyzedText content part is much better suited to represent those data this Engine should be adopted to use the new content part.

      1. myPatch.diff
        94 kB
        Alessio Bosca

        Activity

        Hide
        alessio.bosca Alessio Bosca added a comment -

        Dear Rupert,

        thanks for the good news. I have not forget the integration of the
        SentimentAnalysis service in a new engine, I've been busy for another
        project but I'm planning to work on it as soon as possible (probably
        next week). Sorry for the delay!

        Bests
        Alessio


        *************************************
        Alessio Bosca, Ph.D.
        CELI s.r.l.
        Via San Quintino 31
        10121 Torino
        Tel. +39 011.562.71.15
        Fax +39 011.506.40.86
        http://www.celi.it
        *************************************

        Show
        alessio.bosca Alessio Bosca added a comment - Dear Rupert, thanks for the good news. I have not forget the integration of the SentimentAnalysis service in a new engine, I've been busy for another project but I'm planning to work on it as soon as possible (probably next week). Sorry for the delay! Bests Alessio – ************************************* Alessio Bosca, Ph.D. CELI s.r.l. Via San Quintino 31 10121 Torino Tel. +39 011.562.71.15 Fax +39 011.506.40.86 http://www.celi.it *************************************
        Hide
        rwesten Rupert Westenthaler added a comment -

        with http://svn.apache.org/viewvc?rev=1412121&view=rev this is considered to be fixed. There are still a lot of possible improvements and the MorphoFeatures will most likely undergo some more revisions (especially as soon as a 2nd lemmatizer will get implemented) but those things should be done in new issues.

        The current version is fully functional. Thanks to the Alessio for all the help not only for this engine but also for defining the Mopho features for STANBOL-734!

        Show
        rwesten Rupert Westenthaler added a comment - with http://svn.apache.org/viewvc?rev=1412121&view=rev this is considered to be fixed. There are still a lot of possible improvements and the MorphoFeatures will most likely undergo some more revisions (especially as soon as a 2nd lemmatizer will get implemented) but those things should be done in new issues. The current version is fully functional. Thanks to the Alessio for all the help not only for this engine but also for defining the Mopho features for STANBOL-734 !
        Hide
        rwesten Rupert Westenthaler added a comment -

        With revision 1400003 there is a first version of a CELI lemmatizer Engine that does write its results to the AnalyzedText ContentPart (STANBOL-734) - CeliAnalyzedTextLemmatizerEngine.

        The CeliLemmatizerEnhancementEngine with completeMorphoAnalysis enabled still adds the same morpho related triples to the Stanbol enhancement RDF graph (as contributed by the patch). This also preserves backward compatibility for users currently using the CELI lemmatizer engine. However adding word level NLP results to the Enhancement metadata will be deprecated as soon as STANBOL-733 is reintegrated with the branch.

        In addition to that:

        • The API changes Alessio implemented in the CeliMorphoFeatures where moved over to the enhancer.nlp.morpho.MorphoFeatures class. CeliMorphoFeatures extends now this class.
        • the CeliTagSetRegistry API is now based on Tags and no longer allows direct access to the TagSets. In addition it does not support unmapped Tags.

        [1] http://svn.apache.org/viewvc?rev=1400003&view=rev

        Show
        rwesten Rupert Westenthaler added a comment - With revision 1400003 there is a first version of a CELI lemmatizer Engine that does write its results to the AnalyzedText ContentPart ( STANBOL-734 ) - CeliAnalyzedTextLemmatizerEngine. The CeliLemmatizerEnhancementEngine with completeMorphoAnalysis enabled still adds the same morpho related triples to the Stanbol enhancement RDF graph (as contributed by the patch). This also preserves backward compatibility for users currently using the CELI lemmatizer engine. However adding word level NLP results to the Enhancement metadata will be deprecated as soon as STANBOL-733 is reintegrated with the branch. In addition to that: The API changes Alessio implemented in the CeliMorphoFeatures where moved over to the enhancer.nlp.morpho.MorphoFeatures class. CeliMorphoFeatures extends now this class. the CeliTagSetRegistry API is now based on Tags and no longer allows direct access to the TagSets. In addition it does not support unmapped Tags. [1] http://svn.apache.org/viewvc?rev=1400003&view=rev
        Hide
        rwesten Rupert Westenthaler added a comment - - edited
            1. Regarding "verbal moods"

        The provided patch aligns verbal moods with a selected set of olia:MorphsyntacticCategory (all under olia:Verb). However those sub-classes are already mapped to LexicalCategories.

        On the other side there is a olia:MoodFeature that looks to represent verb moods. However sub-classes of that do not match the current members of the VerbMood enumeration.

        Because of that I will let the current VerbMood enum as it is for now, but I would like to better understand this.

        Show
        rwesten Rupert Westenthaler added a comment - - edited Regarding "verbal moods" The provided patch aligns verbal moods with a selected set of olia:MorphsyntacticCategory (all under olia:Verb). However those sub-classes are already mapped to LexicalCategories. On the other side there is a olia:MoodFeature that looks to represent verb moods. However sub-classes of that do not match the current members of the VerbMood enumeration. Because of that I will let the current VerbMood enum as it is for now, but I would like to better understand this.
        Hide
        rwesten Rupert Westenthaler added a comment -

        Big thanks for the patch and sorry for the delay, but I had to work on other Stanbol things. I have successfully applied your patch to the trunk and plan to work on it in the coming days.

            1. Regarding Olia POS property (and other similar things):

        I discussed this with Sebastian Hellmann already. The suggestion was to add AnnotationProperties to the String Ontology that do allow direct linking from a Word to the LexicalCategory. (e.g. "string:lecialCategroy" or "string:posClass").

        Here the detailed description:

        OWL ontologies can not link with properties to Classes (only instances). Because of that LexicalCategories are specified in OLIA as Classes while "Tag"s of POS TagSets are modelled as instances (of the POS classes). There exists the olialink property in the String ontology and this property can be used to link to the "Tag".

        While such a link is nice when you assume that the consumer of the RDF graph does use and OWL reasoner with the OLIA-, String- and Mapping-Ontology for the used POS TagSet loaded it is not very meaningful for users that are missing this kind of Infrastructure.

        Because of that I discussed with Sebastian Hellman the addition of an owl:AnnotationProperty to the String Ontology that will allow to link a Word directly with the POS Classes defined by OLIA (entries of the LexicalCategory enumeration). AnnotationPorperties can be used for such things as they MUST BE ignored by any OWL Reasoner.

            1. Regarding "LexicalCategory":

        Probably I will add some additional Categories while adding support for the hierarchical structure define by the Ontology to the Enumeration (see the enumeration for Tenses as an example). An other possibility would be to define a second (hierarchical) Enumeration that with all POS tags defined by OLIA and map those to the currently defined in the LexicalCategory Enumeration. This would make it easier for Components where the granularity of the current LexicalCategories is sufficient.

        best
        Rupert

        Show
        rwesten Rupert Westenthaler added a comment - Big thanks for the patch and sorry for the delay, but I had to work on other Stanbol things. I have successfully applied your patch to the trunk and plan to work on it in the coming days. Regarding Olia POS property (and other similar things): I discussed this with Sebastian Hellmann already. The suggestion was to add AnnotationProperties to the String Ontology that do allow direct linking from a Word to the LexicalCategory. (e.g. "string:lecialCategroy" or "string:posClass"). Here the detailed description: OWL ontologies can not link with properties to Classes (only instances). Because of that LexicalCategories are specified in OLIA as Classes while "Tag"s of POS TagSets are modelled as instances (of the POS classes). There exists the olialink property in the String ontology and this property can be used to link to the "Tag". While such a link is nice when you assume that the consumer of the RDF graph does use and OWL reasoner with the OLIA-, String- and Mapping-Ontology for the used POS TagSet loaded it is not very meaningful for users that are missing this kind of Infrastructure. Because of that I discussed with Sebastian Hellman the addition of an owl:AnnotationProperty to the String Ontology that will allow to link a Word directly with the POS Classes defined by OLIA (entries of the LexicalCategory enumeration). AnnotationPorperties can be used for such things as they MUST BE ignored by any OWL Reasoner. Regarding "LexicalCategory": Probably I will add some additional Categories while adding support for the hierarchical structure define by the Ontology to the Enumeration (see the enumeration for Tenses as an example). An other possibility would be to define a second (hierarchical) Enumeration that with all POS tags defined by OLIA and map those to the currently defined in the LexicalCategory Enumeration. This would make it easier for Components where the granularity of the current LexicalCategories is sufficient. best Rupert
        Hide
        alessio.bosca Alessio Bosca added a comment -

        The changes in this patch include:

        Lemmatizer Engine Behaviour. I substituted the generic
        hasMorphologicalFeature property with specific ones (hasGender, hasNumber,
        hasTense, etc etc) taken from Olia ontology
        Olia is lacking a specific property for the part of speech (pos) and since
        the other morphological properties in Olia (hasGender, hasNumber, etc)
        requires as a domain a pos class I decided to model the pos annotation with a isA

        I changed the test on the full morphoanalysis and checked for specific features (lemma,pos, gender, number) of a given known input (an italian word: casa (house))

        I couldn't find anything more standard for the lemma therefore I left the custom hasLemma property used so far.

        The changes in the code are

        Changes in nlp.pos

        -LexicalCategory:
        -Added Numeral, Clitic, ProperNoun (from Olia)

        Changes in nlp.morpho

        -Case:
        -Corrected typo (nstrumentel -> Instrumental)
        -Added enum for features: Person, VerbMood
        -Renamed Number enum to NumberFeature
        -Added Tag classes for morpho features enums (Gender, Tense, Person, ...)

        Changes in celi package

        Test
        -modified validateMorphoFeatureProperty in Lemmatizer test. Added TERM
        constant to use as input for the full morpho analysys test

        Src
        -added CeliMorphoFeatures that groups the morphological features managed by
        CELI engine , renamed and updated CeliTagsetRegistry

        Show
        alessio.bosca Alessio Bosca added a comment - The changes in this patch include: Lemmatizer Engine Behaviour. I substituted the generic hasMorphologicalFeature property with specific ones (hasGender, hasNumber, hasTense, etc etc) taken from Olia ontology Olia is lacking a specific property for the part of speech (pos) and since the other morphological properties in Olia (hasGender, hasNumber, etc) requires as a domain a pos class I decided to model the pos annotation with a isA I changed the test on the full morphoanalysis and checked for specific features (lemma,pos, gender, number) of a given known input (an italian word: casa (house)) I couldn't find anything more standard for the lemma therefore I left the custom hasLemma property used so far. The changes in the code are Changes in nlp.pos -LexicalCategory: -Added Numeral, Clitic, ProperNoun (from Olia) Changes in nlp.morpho -Case: -Corrected typo (nstrumentel -> Instrumental) -Added enum for features: Person, VerbMood -Renamed Number enum to NumberFeature -Added Tag classes for morpho features enums (Gender, Tense, Person, ...) Changes in celi package Test -modified validateMorphoFeatureProperty in Lemmatizer test. Added TERM constant to use as input for the full morpho analysys test Src -added CeliMorphoFeatures that groups the morphological features managed by CELI engine , renamed and updated CeliTagsetRegistry

          People

          • Assignee:
            rwesten Rupert Westenthaler
            Reporter:
            rwesten Rupert Westenthaler
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development