Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: enhancer-0.10.0
    • Component/s: Enhancer
    • Labels:
      None

      Description

      This issue covers the NLP processing components as discussed in http://markmail.org/message/qxusiup3mim2lhpx

      Goals
      =====

      1. provide a modular infrastructure for NLP-related things

      Many tasks in NLP can be computationally intensive, and there is no "one fits
      all" NLP approach when analysing text. Therefore, we wanted to have a NLP
      infrastructure that can be configured and wired together as needed for the
      specific use case, with several specialised modules that can build upon each
      other but many of which are optional.

      2. provide a unified data model for representing NLP text annotations

      In many szenarios, it will be necessary to implement custom engines building on
      the results of a previous "generic" analysis of the text (e.g. POS tagging and
      chunking). For example, in a project we are identifying so-called "noun
      phrases", use a lemmatizer to build the ground form, then convert this to
      singular nominative form to have a gramatically correct label to use in a tag
      cloud. Most of this builds on generic NLP functionality, but the last step is
      very specific to the use case.

      Therefore, we wanted also to implement a generic NLP data model that allows
      representing text annotations attached to individual words or also to spans of
      words.

      1. srfgkmt-stanbol-nlp.zip
        139 kB
        Sebastian Schaffert

        Issue Links

          Activity

          Hide
          Rupert Westenthaler added a comment -

          The remaining sub-tasks where converted into own issues (all of type "new feature"). All core functionalities are resolved. Therefore this can be resolved as well.

          Documentation is available at

          http://stanbol.apache.org/docs/trunk/components/enhancer/nlp/

          Show
          Rupert Westenthaler added a comment - The remaining sub-tasks where converted into own issues (all of type "new feature"). All core functionalities are resolved. Therefore this can be resolved as well. Documentation is available at http://stanbol.apache.org/docs/trunk/components/enhancer/nlp/
          Hide
          Rupert Westenthaler added a comment -

          Status update:

          The patch provided by Sebastian Schaffert was applied with revision 1387488 [1]. I also added the data files used by the contributed Engines to

          {stanbol}

          /data. The German noun phrase chunker was added to the o.a.s.data.opennlp.lang.de module. For the sentiment related data files new modules and a sentiment bundlelist was created. I also added a special Laucher (nlp-launcher) intended to be used for testing developments in the nlp-processing branch.

          In a second commit [2] I slightly changed the default configuration of the Engines so that they can use ConfigurationPolicy.OPTIONAL - meaning that an instance of those Engines is active by default. Also a "nlp-processing" chain configuration was added to the default launcher.

          The nlp-processing branch is now in a state that early adopters might start to test it. I will continue to work on the adaption of the CELI Lemmatizer Engine (STANBOL-739) and the usage of the nlp-processing results by the KeywordLinkingEngine (STANBOL-740)

          [1] http://svn.apache.org/viewvc?rev=1387488&view=rev
          [2] http://svn.apache.org/viewvc?rev=1387596&view=rev

          Show
          Rupert Westenthaler added a comment - Status update: The patch provided by Sebastian Schaffert was applied with revision 1387488 [1] . I also added the data files used by the contributed Engines to {stanbol} /data. The German noun phrase chunker was added to the o.a.s.data.opennlp.lang.de module. For the sentiment related data files new modules and a sentiment bundlelist was created. I also added a special Laucher (nlp-launcher) intended to be used for testing developments in the nlp-processing branch. In a second commit [2] I slightly changed the default configuration of the Engines so that they can use ConfigurationPolicy.OPTIONAL - meaning that an instance of those Engines is active by default. Also a "nlp-processing" chain configuration was added to the default launcher. The nlp-processing branch is now in a state that early adopters might start to test it. I will continue to work on the adaption of the CELI Lemmatizer Engine ( STANBOL-739 ) and the usage of the nlp-processing results by the KeywordLinkingEngine ( STANBOL-740 ) [1] http://svn.apache.org/viewvc?rev=1387488&view=rev [2] http://svn.apache.org/viewvc?rev=1387596&view=rev
          Hide
          Sebastian Schaffert added a comment -

          Here is the correct link to the dropbox folder with the data files: https://www.dropbox.com/sh/lrke4vs4em2n7c4/M9PKoyl-ye/stanbol

          Show
          Sebastian Schaffert added a comment - Here is the correct link to the dropbox folder with the data files: https://www.dropbox.com/sh/lrke4vs4em2n7c4/M9PKoyl-ye/stanbol
          Hide
          Sebastian Schaffert added a comment -

          A patch containing NLP enhancement engines for Apache Stanbol addressing the goals mentioned in the issue. This excludes all data files, they can be found at https://www.dropbox.com/home/Public/stanbol

          Show
          Sebastian Schaffert added a comment - A patch containing NLP enhancement engines for Apache Stanbol addressing the goals mentioned in the issue. This excludes all data files, they can be found at https://www.dropbox.com/home/Public/stanbol

            People

            • Assignee:
              Unassigned
              Reporter:
              Rupert Westenthaler
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development